Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wow, thanks for sharing. I didn't realize how closely related they were.

(TLDR For anyone wondering, "recent HN issues" means HN very likely went down yesterday because of this same bug, when two (edit: two pairs, four total) enterprise SSDs with old firmware died after 40,000 hours close together. An admin of HN and its host both like this theory. See details in that thread.)

Edit: If you want to discuss that theory, it's probably better to do it in that other thread directly instead... dang and a person from M5 Hosting (HN's previous host) are both participating there.



Not two SSDs, four: two in the main server, and two in the backup server.


Yowch. The old "stagger your drive replacements, stagger your batches" thing might not be quite as outdated as we'd like to think...


I have definitely seen RAID arrays where the drives were all part of a single manufacturing batch, and multiple drives all failed in rapid succession. I think this can be caused by several things:

- Unless you periodically do full-drive reads, you may silently accumulate bad blocks across multiple drives in an array. When you finally detect a failed drive, you discover that other drives have also been failing for months.

- A full RAID rebuild is a high-stress event that tries to read every disk block on every drive, as rapidly as possible.

- And finally, some drive batches are just dodgy, and it may not take much to push them over. And if identically dodgy drives are all exposed to exactly the same thermal stress and the same I/O operations, then I guess they might fail close together?

Honestly, RAID arrays only buy you so much reliability. Hardware RAID controllers are another single point of failure. I once lost two drives and a RAID controller all together during Christmas, which was not a fun time.

I do like the modern idea of S3-like storage, where data is replicated over several independent machines, and the controlling software can recover from losing entire servers (or even data centers). It's not a perfect match for everything, but it works great for lots of things.


You are spot on with everything, especially RAID controllers.

I used to help manage a large fleet of database servers. We found that blocks could "rot" on the underlying storage, yet if they were read often enough they would be held in memory for months and never re-read from the underlying drive. Until you rebooted!


Yes bitrot is a huge problem with mechanical hard drives and media that hasn’t been read in a long time. What you write to the drive might not be what you read back five years later. That’s why ZFS is critical for systems like this where you have checksums for each block of data and can rebuild from parity if there is a mismatch


100% spot on on RAID, I actually had it happen once that a second disk failed while under load of rebuilding the array after the first disk failure, not related to the SSD issue.


ZFS is the last file system you will ever need


Zfs is probably the best filesystem tech of this generation. The best filesystem tech of the next generation... I suspect ceph.


Yeah true, if ZFS could span multiple servers it would be perfect


I had no idea anyone thought this was outdated. We certainly never stopped doing it. I think it is a timeless failsafe.


Yeah, Backblaze and DigitalOcean both talk about it a bunch in their sysops stuff.


About 20 years ago I worked for a small storage company, the person that managed the returns of disks form customers was very strongly of the opinion that the odd firmware versions on Seagate drives were returned way more often than the even.


That's the kind of superstition that's only brought about by deep trauma.


back then working with Seagate drives was a trauma in itself. In 2008/2009, I've setup more than 3000 1 TB Barracuda ES drives; 1800 of them failed in the following 3 years (they came with a 5 years warranty). I stopped keeping track of Barracuda failures at some point.

Unsurprisingly, 14 years later I still wouldn't recommend Seagate drives to anyone.


Thinking back mixing firmware versions on new units was avoided where possible and we would also try to replace like for like firmware on RMAs


Plus you mitigate getting a bad batch of the same drive. See IBM Deathstars and the Seagate drives.

May have to go check my up hours on my drives now, I must have a few nearing that sort of write hours


That's basically sysadmin scripture. Ignore it at your own peril.


Or, update the firmware on your switches, servers, drives, etc on a regular basis.


Thanks for the correction!


the chance of two SSD's failing at the same time under normal circumstances is extremely slim. So this might actually be a good cause of this incident.


True. But it's more about the probability of things being "normal", isn't it? I had multiple Evo970 fail within a very short time. Turns out to be a systematic problem of drives produced in one specific month.

Just how much difference is enough to be safe is the price question...


It seems more likely it was four drives (though dang and Mike both refer to "two" in the earlier thread).

Both primary and failover servers had RAID arrays. I suspect RAID 10 (striped mirror), which would mean two drives would have to fail to take down a single server.

Four drives of the same manufacturer spec and batch would do that.


This is what always worries me about our home server. It's running ZFS with multiple redundant drives but the supplier refused (when I explicitly asked) to supply it with disks known to be from different batches claiming that the odds of multiple failures close together were negligible. Obviously we have backups as well but the time and cost to restore a full server from online backups can be significant.


Within an earlier thread "Tell HN: HN Moved from M5 to AWS", there's an excellent comment by loxias about risk diversification across multiple factors. Well worth reading:

https://news.ycombinator.com/item?id=32031655

I've increasingly come to view systems operations / SRE as a risk management exercise, where the goal is to reduce the odds of a catastrophic failure. Total system outage is one level, unrecoverable total system outage is even worse.

Having multiple redundant backups / storage systems, in different locations, with different vendor hardware / stacks, all helps reduce risk of a single-factor outage. Though complexity risk is its own issue.


> I've increasingly come to view systems operations / SRE as a risk management exercise, where the goal is to reduce the odds of a catastrophic failure.

s/reduce/find an appropriate level for/

It's a common misconception that risk management and risk reduction are synonyms. Risk management is about finding the right level of risk given external factors. Sometimes that means maintaining the current level of risk or even increasing it in favour of other properties.


My point is that risk is central to systems management. If you look at earlier standard texts on the subject, e.g., Nemeth or Frisch, the concept of risk is all but entirely missing. I've numberous disagreements with Google, but one place where I agree is that the term SRE, systems reliability engineer, puts the notion of managing for stability front and centre, and inherently acknowledges the principle of risk. I've since heard from others that this is in fact how the practice is presented and taught there.

Quibbling over whether the proper term is risk management or risk reduction rather spectacularly misses the forest for the trees.


A friend of mine who was once at Google mentioned that in a meeting once their SRE group was told by their senior manager roughly "you've broken stuff in $time_period far less than your outage budget, and that probably means you should've been rolling out features faster".

Risk -levels- are a choice and also a trade-off.


Okay, fair enough. To me, the risk involvement is so obvious (to any serious business function!) that I find the management/reduction distinction a more important point. But I can see it your way too.


Thanks.

I don't know how long you've been in the business, but the change seems a relatively recent one, one that wasn't manifestly obvious to me, and one that has pretty much always seemed difficult to communicate to management.

Whether that's because business management is often about ignoring risks or treating it as inconvenient, or if I've just had a long string of bad bosses, I'm not sure.

I did make a point of looking through several of the books that were formative for me (mostly 1990s and 2000s publication dates), and there's little addressing the point. Limonchelli's book on time management for sysadmins was a notable departure from the standard when it came out, in 2008. I'd say that marked the shift toward structured and process-oriented practices.

That was about the time of the transition from "pets" to "cattle" (focus on individual servers vs. groups / farms), but pre-dates the cloud transition.


You know what? You're right again!

The stuff I've read that touches on this idea is almost all from 2006 and onwards, mainly 2010s. The earliest example is a bit of an outlier: Douglas Hubbard's 1985 How to Measure Anything -- but it's also only tangentially related.

The other real exceptions are books on statistics (where the idea of risk management -- at least in my collection -- seems to have gotten popular in the 1950s, probably as a result of the second World war) and financial risk management (which seems to really have taken off in the 1980s, probably in conjunction with options becoming a thing.) Statisticians and finance people (and by extension e.g. poker and bridge players) have known this stuff for a while.

Of course, hydrologists have been doing this stuff since the early 1900s at least, but extreme value theory has always been a kind of niche so I'm not sure I should count that.

----

That said, I did mention it was obvious to me. I still find it hard to convince management and colleagues of its importance...


> The stuff I've read that touches on this idea is almost all from 2006 and onwards, mainly 2010s. The earliest example is a bit of an outlier: Douglas Hubbard's 1985 How to Measure Anything -- but it's also only tangentially related.

I didn’t keep good track of such things but a lot of my early reading in the 70’s was in operations research and decision support systems, mostly sort of what we call operational analytics these days with a big helping of statistical process control too. World War 2 logistics practices and ‘50s and ‘60s “scientific management” fads generated a lot of material, some insightful. Many medium-sized businesses could afford significant R&D then, so you’ll find e.g. furniture factories developing their own computer systems from PCBs to custom ASIC components, just to manage statistical process control and decision support systems.

> That said, I did mention it was obvious to me. I still find it hard to convince management and colleagues of its importance...

I think the reason I keep having to justify this every few years is the tendency towards abstractions in management which try to simplify things into “anecdotal analytics”, e.g. preferring a persuasive narrative over reality...for a good cynical perspective from the ‘50s I recommend C.M. Kornbluth’s “The Marching Morons” (<https://en.wikipedia.org/wiki/The_Marching_Morons>).


> It's a common misconception that risk management and risk reduction are synonyms. Risk management is about finding the right level of risk given external factors. Sometimes that means maintaining the current level of risk or even increasing it in favour of other properties.

What’s funny is I seem to have to explain this to senior management anew every 6-7 years. I know they teach it in management school, but it in the real world somehow people fall into the false equivalence when they get promoted. Often they adopt a cartoonish view of things because they can’t get the quantitative signals and everything decision effectively reduces to what I ironically term as anecdotal analytics.

I have this amusing heuristic for risk acceptance which I often use to help people approach decisions: you should kick the decision up to someone with higher authority if your signing authority is less than: risk coefficient times quantified exposure, less mitigation cost where mitigation is within signing authority AND/OR budgeted and authorized spend. I like to view mitigation and opportunity cost/benefit in a similar way so I have some idea of equivalences when evaluating tradeoffs.

It’s not original with me, I must have lifted it from some decades-past HBR article or 60’s rant on quantitative business management.

I could rant on various aspects of risk management application all day, thank goodness I've managed to quit before I really got started sharing...in my experience it’s been very helpful when applied in real world engineering implementations.


One of the advantages startups have is higher risk tolerance than branded megacorporations. Data, customers, brand, employees, law suits, or cynically, human lives.


The risk-tolerance of startups is illusory.

The risk is being managed at the VC/investor level, by diversifying investment bets over numerous early ventures.

The death of any one of those isn't a concern for the VC, if the portfolio performance is sufficient. Of course, for the individual venture and employees, that risk is disaggregated.

More rigorous systems practices are seen as an impediment to early growth with any potential problems either something that can be ironed out later, or simply a post-liquidation concern that doesn't factor into the investors' interests at all.


The risk tolerance is there result of not potentially taking out $1T of additional value if you f-- up a small project at a mega corporation.

At a mega corporation, you can't take risks which make complete sense for the project you're on independent of uncapped liability for the mothership.

There are misalignments like the one you describe but that's not what I'm talking about.


> Total system outage is one level, unrecoverable total system outage is even worse.

Ha, I used to suggest people consider “total failure of business” in exposure quantification...


My solution is to use a different manufacturer for each drive in a mirror. The prices are usually pretty similar and you get to make sure that one firmware bug doesn't kill your entire pool.


This is the way.

For even more peace of mind, (and only when you can afford it, obviously) try decoupling your disk purchases a bit from when you're going to need them.

When you see a good price or a sale on a particular disk, grab it add it to your own personal "prebought disk pool". When it's time to either replace a disk or spin up a whole new array, now you have the benefit of diversification across time.


My current procedure is to keep an external drive for backups and if any of the drives in my RAID fails, I'll just shuck the external and stick it in the RAID. The advantage is that the drive is already known to be good through running badblocks (which takes like a week to run), and I don't need to wait for a week for the Amazon man to get here. Disadvantage is that I need to recreate my backup from the start, which loses out my version history, or restore it from an online copy, which is slow and cumbersome.


I do the same. Ages ago I once had to build a server at work and picked three vendors for the raid 5. Got funny looks drom coworkers, apparently they found the idea super strange. One drive (Seagate of course) failed after a year, and since we had a matching size WD lying around, used that. Now there were two WDs in the setup. Some years later the PSU blew up and killed both WDs, the Toshiba survived.


I decided to employ this tactic when I was setting up my new NAS and needed two drives.

Upside was that I could definitely know they weren't from the same batch.

Downside was that I had to buy a Seagate, and I don't have good experiences with Seagate since my only Seagate drive had died an early death at the tender age of 3. Turns out that this was very much a downside since the Seagate drive died at the tender age of 16 months.


I had a series of WD drives that failed, and I managed to get them all replaced under warranty since they died within 3-5 years. I don't buy WD drives anymore but it wasn't the end of the world since I had spares while I waited for the replacement drives to be shipped.

Anecdotally, I haven't had issues with Seagate but I'm sure it really boils down to which exact drives you're using and what batch they were in.


Pick a manufacturer and you'll be able to find plenty of horror stories.

Some may be worse than others but diversification is the right answer anyway.


But how do you color match your drives in your spiffy NAS?

This is why my bicycle drivetrain should be a frankenstein combination of parts from different manufacturers?

/s


I hate absolutely everything about this comment.

If we're ever both at the same conference show me a link to this and I'll buy the first round.


> But how do you color match your drives in your spiffy NAS?

haha, pimp your ride!


I had a series of WD drives that failed, and I managed to get them all replaced under warranty since they died within 3-5 years.

I find hard drive warranties to be mostly an illusion. It's better now that full disk encryption is becoming better supported and potentially available on personal devices and not just corporate ones managed by IT professionals. However until recently the number of drives I've had in any personal/home system that I would have returned under warranty instead of securely destroying to prevent the risk of data leakage was zero. The number of phones I have ever traded in is similarly zero. It's horribly wasteful but until there are cast iron guarantees that all the private data we keep on these devices is going to be securely deleted it's the only sane policy IMHO (apart from never using these devices for anything remotely sensitive in the first place but that's all but impossible in modern society).


All of my drives have FDE. I wouldn't have shipped the drives if that wasn't the case (also luckily the issue was that writes only failed on part of the disk so I could wipe the luks metadata section).


Possibly the supplier was talking of hardware failures.

The issue here (as it was several years ago with the re-known Seagate 7200.11 issue [0]) is not about the odds of multiple (hardware) failures together (which may actually be a very rare case), in these case it is essentially a software failure, a counter that crashes the on-disk operating system (if we can call it so) be it an overflow of the counter or hitting a certain value.

The chances of having almost simultaneous failures is near to certainty for drives that are booted the same number of times and have been powered for the same number of hours, if the affected counters are related to these events.

[0] Some reference:

https://msfn.org/board/topic/128807-the-solution-for-seagate...

>Root Cause

This condition was introduced by a firmware issue that sets the drive event log to an invalid location causing the drive to become inaccessible.

The firmware issue is that the end boundary of the event log circular buffer (320) was set incorrectly. During Event Log initialization, the boundary condition that defines the end of the Event Log is off by one. During power up, if the Event Log counter is at entry 320, or a multiple of (320 + x*256), and if a particular data pattern (dependent on the type of tester used during the drive manufacturing test process) had been present in the reserved-area system tracks when the drive's reserved-area file system was created during manufacturing, firmware will increment the Event Log pointer past the end of the event log data structure. This error is detected and results in an "Assert Failure", which causes the drive to hang as a failsafe measure. When the drive enters failsafe further update s to the counter become impossible and the condition will remain through subsequent power cycles. The problem only arises if a power cycle initialization occurs when the Event Log is at 320 or some multiple of 256 thereafter. Once a drive is in this state, there is no path to resolve/recover existing failed drives without Seagate technical intervention. For a drive to be susceptible to this issue, it must have both the firmware that contains the issue and have been tested through the specific manufacturing process.


If you’re willing to wait, you can always order half, wait a month or two, then order the other half.


Unfortunately this particular server was a replacement for another that had failed suddenly so that wasn't really an option in my case. If it had been one of many at work then it would have been a sensible option, I agree.


Repurposing several of the initial drives with later purchases (or exchanges) might be another option.


If it's really 4 drives, bought at the same time, failing simultaneously, that's pretty damning evidence.

Remind me of my laptop that I bought with 2 SSDs. Not from HP or Dell, but still, I now wonder if I should replace one of them with a more recent SSD and give the other to my son (he currently has an anemic SSD that's too small to install Genshin Impact on).


Assuming you have the budget for it a worst case scenario of "I made my son happy while achieving nothing -else-" doesn't strike me as terrible at all.


It costs money and there's no guarantee it will actually make him happy. It could lead to him playing the game in some dark corner where no one can find him. There are advantages to him having to use the desktop PC.


Especially since one pair was a nearly unused backup server that had a totally different use profile.


If both SSD's are from the same lot number and one fails, the chances of the second failing go up by a high amount. Both failing at the same time though is extremely rare.


We (as an industry) went through this bad batch madness with the IBM DeskStar 75GXP hard drives, which were affectionately referred to as "IBM Deathstar"[1].

It's rare, but it's not _that_ rare. You have to make the effort to understand why it failed.

I had a situation where I deployed Toshiba SLC SSDs (that were purchased over the course of several months) and a piece of software that synchronized to disk frequently, resulting in about 1GB of writes per hour.

After ~11 months in service, most of the drives died in the same 4 week period. We were astounded that everything failed so close to each other, including instances where both drives in a RAID 1 set were toast.

We did extensive troubleshooting between the failed servers and the remaining servers and figured out that write volume (by proxy of in-service date) was the one predictor of failure. Shortly thereafter, wear leveling and TRIM became things we sought out mentions of when spec'ing out hardware.

1: https://en.wikipedia.org/wiki/Deskstar


There was also more recently the case of the Seagate 7200.11, see my previous comment:

https://news.ycombinator.com/item?id=32053477


It usually takes something really bad happening before things get better.

Absolute best mechanical drives available until quite recently can be traced back to Deatstar. Deskstar 7K4000 were absolute best in class.

https://en.wikipedia.org/wiki/Deskstar

Hitachi bought IBM hard drive business in 2003 for $2B. Sadly Its now owned by WD.


Of the ~20 assorted Deskstar and related Ultrastar drives of that vintage I've had in regular use over the past decade, exactly one has failed...because I dropped it.

HGST SAS SSDs (the ones that pair Intel NAND and Hitachi SAS controllers) have also been reliable performers in my home office experience, even without (RAID controller support for) UNMAP (SCSI "TRIM"). Incidentally, these now appear to be selling on eBay for more than I paid for them "lightly used" several years ago.



Back then people still used Slashdot:

Wondering if it's real... https://m.slashdot.org/story/20680

Years later, it's a widespread phenomenon: https://m.slashdot.org/story/43312

It was affectionately called the "click of death".


>But don't we all love them now because they support linux?

Ah, inventing an opinion to get angry about: some things never change.


Why would there be submissions? These drives predate HN.


HN occasionally discusses issues pre-dating itself.


... and the lifetime (deathtime?) award for the Deathstar only predated HN by a few months:

May 26, 2006: https://www.pcworld.com/article/535838/worst_products_ever.h...

October 9, 2006: https://news.ycombinator.com/item?id=1

Memory would still have been reasonably green.


It does however come up in comments regularly. I know I've brought it up more than once, because I had a week long ordeal replacing all the drives in an array as they died one by one back in the day.


The deathstars were fantastic, they almost always failed on the outer edges of the platters.

So if you only formatted them (filesystem wise) out to capacity-2Gb they were a really cheap option at the time.


That’s a very different definition of fantastic than the one I use.


Once I'd worked it out - which was after the problems were public and therefore the price had utterly cratered - they were by far the cheapest storage per Gb available at the time (think "by a factor of two").

I would not have let a normal business user near one, but the developers I was supporting were most pleased about their larger than expected scratch disks for test databases and intermediate compilation artifacts.

Everything breaks. Things that at least break predictably make me happier than the alternative.


The way I always put it is, you have identical drives, with identical usage, powered for identical times. and you are still surprised when a second drive fails under the high stress environment of rebuilding after the first drive fails.


Heh, no. We had a fleet of HPE Cloudline (CL3100) failing at the same time because the SSDs exhausted the writes.


perhaps they are the same model. iirc people recommend not to use the same model of hardware to provide redundancy.


I generally replace HDDs in my personal zpool a few days apart, for this reason. I also order them from different suppliers, so I can get different manufacture dates.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: