Wow, thanks for sharing. I didn't realize how closely related they were. (TLDR F...

mkl · on July 10, 2022

Not two SSDs, four: two in the main server, and two in the backup server.

taneq · on July 11, 2022

Yowch. The old "stagger your drive replacements, stagger your batches" thing might not be quite as outdated as we'd like to think...

ekidd · on July 11, 2022

I have definitely seen RAID arrays where the drives were all part of a single manufacturing batch, and multiple drives all failed in rapid succession. I think this can be caused by several things:

- Unless you periodically do full-drive reads, you may silently accumulate bad blocks across multiple drives in an array. When you finally detect a failed drive, you discover that other drives have also been failing for months.

- A full RAID rebuild is a high-stress event that tries to read every disk block on every drive, as rapidly as possible.

- And finally, some drive batches are just dodgy, and it may not take much to push them over. And if identically dodgy drives are all exposed to exactly the same thermal stress and the same I/O operations, then I guess they might fail close together?

Honestly, RAID arrays only buy you so much reliability. Hardware RAID controllers are another single point of failure. I once lost two drives and a RAID controller all together during Christmas, which was not a fun time.

I do like the modern idea of S3-like storage, where data is replicated over several independent machines, and the controlling software can recover from losing entire servers (or even data centers). It's not a perfect match for everything, but it works great for lots of things.

ericbarrett · on July 11, 2022

You are spot on with everything, especially RAID controllers.

I used to help manage a large fleet of database servers. We found that blocks could "rot" on the underlying storage, yet if they were read often enough they would be held in memory for months and never re-read from the underlying drive. Until you rebooted!

water8 · on July 11, 2022

Yes bitrot is a huge problem with mechanical hard drives and media that hasn’t been read in a long time. What you write to the drive might not be what you read back five years later. That’s why ZFS is critical for systems like this where you have checksums for each block of data and can rebuild from parity if there is a mismatch

Mo3 · on July 11, 2022

100% spot on on RAID, I actually had it happen once that a second disk failed while under load of rebuilding the array after the first disk failure, not related to the SSD issue.

water8 · on July 11, 2022

ZFS is the last file system you will ever need

somat · on July 11, 2022

Zfs is probably the best filesystem tech of this generation. The best filesystem tech of the next generation... I suspect ceph.

water8 · on July 12, 2022

Yeah true, if ZFS could span multiple servers it would be perfect

hackmiester · on July 11, 2022

I had no idea anyone thought this was outdated. We certainly never stopped doing it. I think it is a timeless failsafe.

icelancer · on July 11, 2022

Yeah, Backblaze and DigitalOcean both talk about it a bunch in their sysops stuff.

mywacaday · on July 11, 2022

About 20 years ago I worked for a small storage company, the person that managed the returns of disks form customers was very strongly of the opinion that the odd firmware versions on Seagate drives were returned way more often than the even.

bobsmooth · on July 11, 2022

That's the kind of superstition that's only brought about by deep trauma.

wazoox · on July 11, 2022

back then working with Seagate drives was a trauma in itself. In 2008/2009, I've setup more than 3000 1 TB Barracuda ES drives; 1800 of them failed in the following 3 years (they came with a 5 years warranty). I stopped keeping track of Barracuda failures at some point.

Unsurprisingly, 14 years later I still wouldn't recommend Seagate drives to anyone.

mywacaday · on July 11, 2022

Thinking back mixing firmware versions on new units was avoided where possible and we would also try to replace like for like firmware on RMAs

TeamMCS · on July 11, 2022

Plus you mitigate getting a bad batch of the same drive. See IBM Deathstars and the Seagate drives.

May have to go check my up hours on my drives now, I must have a few nearing that sort of write hours

Nux · on July 11, 2022

That's basically sysadmin scripture. Ignore it at your own peril.

bluedino · on July 11, 2022

Or, update the firmware on your switches, servers, drives, etc on a regular basis.

solardev · on July 10, 2022

Thanks for the correction!

kazen44 · on July 10, 2022

the chance of two SSD's failing at the same time under normal circumstances is extremely slim. So this might actually be a good cause of this incident.

jeffreygoesto · on July 11, 2022

True. But it's more about the probability of things being "normal", isn't it? I had multiple Evo970 fail within a very short time. Turns out to be a systematic problem of drives produced in one specific month.

Just how much difference is enough to be safe is the price question...

dredmorbius · on July 10, 2022

It seems more likely it was four drives (though dang and Mike both refer to "two" in the earlier thread).

Both primary and failover servers had RAID arrays. I suspect RAID 10 (striped mirror), which would mean two drives would have to fail to take down a single server.

Four drives of the same manufacturer spec and batch would do that.

Silhouette · on July 11, 2022

This is what always worries me about our home server. It's running ZFS with multiple redundant drives but the supplier refused (when I explicitly asked) to supply it with disks known to be from different batches claiming that the odds of multiple failures close together were negligible. Obviously we have backups as well but the time and cost to restore a full server from online backups can be significant.

dredmorbius · on July 11, 2022

Within an earlier thread "Tell HN: HN Moved from M5 to AWS", there's an excellent comment by loxias about risk diversification across multiple factors. Well worth reading:

https://news.ycombinator.com/item?id=32031655

I've increasingly come to view systems operations / SRE as a risk management exercise, where the goal is to reduce the odds of a catastrophic failure. Total system outage is one level, unrecoverable total system outage is even worse.

Having multiple redundant backups / storage systems, in different locations, with different vendor hardware / stacks, all helps reduce risk of a single-factor outage. Though complexity risk is its own issue.

kqr · on July 11, 2022

> I've increasingly come to view systems operations / SRE as a risk management exercise, where the goal is to reduce the odds of a catastrophic failure.

s/reduce/find an appropriate level for/

It's a common misconception that risk management and risk reduction are synonyms. Risk management is about finding the right level of risk given external factors. Sometimes that means maintaining the current level of risk or even increasing it in favour of other properties.

dredmorbius · on July 11, 2022

My point is that risk is central to systems management. If you look at earlier standard texts on the subject, e.g., Nemeth or Frisch, the concept of risk is all but entirely missing. I've numberous disagreements with Google, but one place where I agree is that the term SRE, systems reliability engineer, puts the notion of managing for stability front and centre, and inherently acknowledges the principle of risk. I've since heard from others that this is in fact how the practice is presented and taught there.

Quibbling over whether the proper term is risk management or risk reduction rather spectacularly misses the forest for the trees.

mst · on July 11, 2022

A friend of mine who was once at Google mentioned that in a meeting once their SRE group was told by their senior manager roughly "you've broken stuff in $time_period far less than your outage budget, and that probably means you should've been rolling out features faster".

Risk -levels- are a choice and also a trade-off.

kqr · on July 11, 2022

Okay, fair enough. To me, the risk involvement is so obvious (to any serious business function!) that I find the management/reduction distinction a more important point. But I can see it your way too.

dredmorbius · on July 11, 2022

Thanks.

I don't know how long you've been in the business, but the change seems a relatively recent one, one that wasn't manifestly obvious to me, and one that has pretty much always seemed difficult to communicate to management.

Whether that's because business management is often about ignoring risks or treating it as inconvenient, or if I've just had a long string of bad bosses, I'm not sure.

I did make a point of looking through several of the books that were formative for me (mostly 1990s and 2000s publication dates), and there's little addressing the point. Limonchelli's book on time management for sysadmins was a notable departure from the standard when it came out, in 2008. I'd say that marked the shift toward structured and process-oriented practices.

That was about the time of the transition from "pets" to "cattle" (focus on individual servers vs. groups / farms), but pre-dates the cloud transition.

kqr · on July 11, 2022

You know what? You're right again!

The stuff I've read that touches on this idea is almost all from 2006 and onwards, mainly 2010s. The earliest example is a bit of an outlier: Douglas Hubbard's 1985 How to Measure Anything -- but it's also only tangentially related.

The other real exceptions are books on statistics (where the idea of risk management -- at least in my collection -- seems to have gotten popular in the 1950s, probably as a result of the second World war) and financial risk management (which seems to really have taken off in the 1980s, probably in conjunction with options becoming a thing.) Statisticians and finance people (and by extension e.g. poker and bridge players) have known this stuff for a while.

Of course, hydrologists have been doing this stuff since the early 1900s at least, but extreme value theory has always been a kind of niche so I'm not sure I should count that.

----

That said, I did mention it was obvious to me. I still find it hard to convince management and colleagues of its importance...

no-s · on July 11, 2022

> The stuff I've read that touches on this idea is almost all from 2006 and onwards, mainly 2010s. The earliest example is a bit of an outlier: Douglas Hubbard's 1985 How to Measure Anything -- but it's also only tangentially related.

I didn’t keep good track of such things but a lot of my early reading in the 70’s was in operations research and decision support systems, mostly sort of what we call operational analytics these days with a big helping of statistical process control too. World War 2 logistics practices and ‘50s and ‘60s “scientific management” fads generated a lot of material, some insightful. Many medium-sized businesses could afford significant R&D then, so you’ll find e.g. furniture factories developing their own computer systems from PCBs to custom ASIC components, just to manage statistical process control and decision support systems.

> That said, I did mention it was obvious to me. I still find it hard to convince management and colleagues of its importance...

I think the reason I keep having to justify this every few years is the tendency towards abstractions in management which try to simplify things into “anecdotal analytics”, e.g. preferring a persuasive narrative over reality...for a good cynical perspective from the ‘50s I recommend C.M. Kornbluth’s “The Marching Morons” (<https://en.wikipedia.org/wiki/The_Marching_Morons>).

no-s · on July 11, 2022

> It's a common misconception that risk management and risk reduction are synonyms. Risk management is about finding the right level of risk given external factors. Sometimes that means maintaining the current level of risk or even increasing it in favour of other properties.

What’s funny is I seem to have to explain this to senior management anew every 6-7 years. I know they teach it in management school, but it in the real world somehow people fall into the false equivalence when they get promoted. Often they adopt a cartoonish view of things because they can’t get the quantitative signals and everything decision effectively reduces to what I ironically term as anecdotal analytics.

I have this amusing heuristic for risk acceptance which I often use to help people approach decisions: you should kick the decision up to someone with higher authority if your signing authority is less than: risk coefficient times quantified exposure, less mitigation cost where mitigation is within signing authority AND/OR budgeted and authorized spend. I like to view mitigation and opportunity cost/benefit in a similar way so I have some idea of equivalences when evaluating tradeoffs.

It’s not original with me, I must have lifted it from some decades-past HBR article or 60’s rant on quantitative business management.

I could rant on various aspects of risk management application all day, thank goodness I've managed to quit before I really got started sharing...in my experience it’s been very helpful when applied in real world engineering implementations.

gfrff · on July 11, 2022

One of the advantages startups have is higher risk tolerance than branded megacorporations. Data, customers, brand, employees, law suits, or cynically, human lives.

dredmorbius · on July 11, 2022

The risk-tolerance of startups is illusory.

The risk is being managed at the VC/investor level, by diversifying investment bets over numerous early ventures.

The death of any one of those isn't a concern for the VC, if the portfolio performance is sufficient. Of course, for the individual venture and employees, that risk is disaggregated.

More rigorous systems practices are seen as an impediment to early growth with any potential problems either something that can be ironed out later, or simply a post-liquidation concern that doesn't factor into the investors' interests at all.

gfrff · on July 12, 2022

The risk tolerance is there result of not potentially taking out $1T of additional value if you f-- up a small project at a mega corporation.

At a mega corporation, you can't take risks which make complete sense for the project you're on independent of uncapped liability for the mothership.

There are misalignments like the one you describe but that's not what I'm talking about.

no-s · on July 11, 2022

> Total system outage is one level, unrecoverable total system outage is even worse.

Ha, I used to suggest people consider “total failure of business” in exposure quantification...

cyphar · on July 11, 2022

My solution is to use a different manufacturer for each drive in a mirror. The prices are usually pretty similar and you get to make sure that one firmware bug doesn't kill your entire pool.

loxias · on July 11, 2022

This is the way.

For even more peace of mind, (and only when you can afford it, obviously) try decoupling your disk purchases a bit from when you're going to need them.

When you see a good price or a sale on a particular disk, grab it add it to your own personal "prebought disk pool". When it's time to either replace a disk or spin up a whole new array, now you have the benefit of diversification across time.

Hamuko · on July 11, 2022

My current procedure is to keep an external drive for backups and if any of the drives in my RAID fails, I'll just shuck the external and stick it in the RAID. The advantage is that the drive is already known to be good through running badblocks (which takes like a week to run), and I don't need to wait for a week for the Amazon man to get here. Disadvantage is that I need to recreate my backup from the start, which loses out my version history, or restore it from an online copy, which is slow and cumbersome.

iforgotpassword · on July 11, 2022

I do the same. Ages ago I once had to build a server at work and picked three vendors for the raid 5. Got funny looks drom coworkers, apparently they found the idea super strange. One drive (Seagate of course) failed after a year, and since we had a matching size WD lying around, used that. Now there were two WDs in the setup. Some years later the PSU blew up and killed both WDs, the Toshiba survived.

Hamuko · on July 11, 2022

I decided to employ this tactic when I was setting up my new NAS and needed two drives.

Upside was that I could definitely know they weren't from the same batch.

Downside was that I had to buy a Seagate, and I don't have good experiences with Seagate since my only Seagate drive had died an early death at the tender age of 3. Turns out that this was very much a downside since the Seagate drive died at the tender age of 16 months.

cyphar · on July 11, 2022

I had a series of WD drives that failed, and I managed to get them all replaced under warranty since they died within 3-5 years. I don't buy WD drives anymore but it wasn't the end of the world since I had spares while I waited for the replacement drives to be shipped.

Anecdotally, I haven't had issues with Seagate but I'm sure it really boils down to which exact drives you're using and what batch they were in.

mst · on July 11, 2022

Pick a manufacturer and you'll be able to find plenty of horror stories.

Some may be worse than others but diversification is the right answer anyway.

xpe · on July 11, 2022

But how do you color match your drives in your spiffy NAS?

This is why my bicycle drivetrain should be a frankenstein combination of parts from different manufacturers?

/s

mst · on July 11, 2022

I hate absolutely everything about this comment.

If we're ever both at the same conference show me a link to this and I'll buy the first round.

no-s · on July 11, 2022

> But how do you color match your drives in your spiffy NAS?

haha, pimp your ride!

Silhouette · on July 11, 2022

I had a series of WD drives that failed, and I managed to get them all replaced under warranty since they died within 3-5 years.

I find hard drive warranties to be mostly an illusion. It's better now that full disk encryption is becoming better supported and potentially available on personal devices and not just corporate ones managed by IT professionals. However until recently the number of drives I've had in any personal/home system that I would have returned under warranty instead of securely destroying to prevent the risk of data leakage was zero. The number of phones I have ever traded in is similarly zero. It's horribly wasteful but until there are cast iron guarantees that all the private data we keep on these devices is going to be securely deleted it's the only sane policy IMHO (apart from never using these devices for anything remotely sensitive in the first place but that's all but impossible in modern society).

cyphar · on July 11, 2022

All of my drives have FDE. I wouldn't have shipped the drives if that wasn't the case (also luckily the issue was that writes only failed on part of the disk so I could wipe the luks metadata section).

jaclaz · on July 11, 2022

Possibly the supplier was talking of hardware failures.

The issue here (as it was several years ago with the re-known Seagate 7200.11 issue [0]) is not about the odds of multiple (hardware) failures together (which may actually be a very rare case), in these case it is essentially a software failure, a counter that crashes the on-disk operating system (if we can call it so) be it an overflow of the counter or hitting a certain value.

The chances of having almost simultaneous failures is near to certainty for drives that are booted the same number of times and have been powered for the same number of hours, if the affected counters are related to these events.

[0] Some reference:

https://msfn.org/board/topic/128807-the-solution-for-seagate...

>Root Cause

This condition was introduced by a firmware issue that sets the drive event log to an invalid location causing the drive to become inaccessible.

The firmware issue is that the end boundary of the event log circular buffer (320) was set incorrectly. During Event Log initialization, the boundary condition that defines the end of the Event Log is off by one. During power up, if the Event Log counter is at entry 320, or a multiple of (320 + x*256), and if a particular data pattern (dependent on the type of tester used during the drive manufacturing test process) had been present in the reserved-area system tracks when the drive's reserved-area file system was created during manufacturing, firmware will increment the Event Log pointer past the end of the event log data structure. This error is detected and results in an "Assert Failure", which causes the drive to hang as a failsafe measure. When the drive enters failsafe further update s to the counter become impossible and the condition will remain through subsequent power cycles. The problem only arises if a power cycle initialization occurs when the Event Log is at 320 or some multiple of 256 thereafter. Once a drive is in this state, there is no path to resolve/recover existing failed drives without Seagate technical intervention. For a drive to be susceptible to this issue, it must have both the firmware that contains the issue and have been tested through the specific manufacturing process.

colejohnson66 · on July 11, 2022

If you’re willing to wait, you can always order half, wait a month or two, then order the other half.

Silhouette · on July 11, 2022

Unfortunately this particular server was a replacement for another that had failed suddenly so that wasn't really an option in my case. If it had been one of many at work then it would have been a sensible option, I agree.

dredmorbius · on July 11, 2022

Repurposing several of the initial drives with later purchases (or exchanges) might be another option.

mcv · on July 11, 2022

If it's really 4 drives, bought at the same time, failing simultaneously, that's pretty damning evidence.

Remind me of my laptop that I bought with 2 SSDs. Not from HP or Dell, but still, I now wonder if I should replace one of them with a more recent SSD and give the other to my son (he currently has an anemic SSD that's too small to install Genshin Impact on).

mst · on July 11, 2022

Assuming you have the budget for it a worst case scenario of "I made my son happy while achieving nothing -else-" doesn't strike me as terrible at all.

mcv · on July 11, 2022

It costs money and there's no guarantee it will actually make him happy. It could lead to him playing the game in some dark corner where no one can find him. There are advantages to him having to use the desktop PC.

MBCook · on July 10, 2022

Especially since one pair was a nearly unused backup server that had a totally different use profile.

Group_B · on July 10, 2022

If both SSD's are from the same lot number and one fails, the chances of the second failing go up by a high amount. Both failing at the same time though is extremely rare.

oogali · on July 11, 2022

We (as an industry) went through this bad batch madness with the IBM DeskStar 75GXP hard drives, which were affectionately referred to as "IBM Deathstar"[1].

It's rare, but it's not _that_ rare. You have to make the effort to understand why it failed.

I had a situation where I deployed Toshiba SLC SSDs (that were purchased over the course of several months) and a piece of software that synchronized to disk frequently, resulting in about 1GB of writes per hour.

After ~11 months in service, most of the drives died in the same 4 week period. We were astounded that everything failed so close to each other, including instances where both drives in a RAID 1 set were toast.

We did extensive troubleshooting between the failed servers and the remaining servers and figured out that write volume (by proxy of in-service date) was the one predictor of failure. Shortly thereafter, wear leveling and TRIM became things we sought out mentions of when spec'ing out hardware.

1: https://en.wikipedia.org/wiki/Deskstar

jaclaz · on July 11, 2022

There was also more recently the case of the Seagate 7200.11, see my previous comment:

https://news.ycombinator.com/item?id=32053477

rasz · on July 11, 2022

It usually takes something really bad happening before things get better.

Absolute best mechanical drives available until quite recently can be traced back to Deatstar. Deskstar 7K4000 were absolute best in class.

https://en.wikipedia.org/wiki/Deskstar

Hitachi bought IBM hard drive business in 2003 for $2B. Sadly Its now owned by WD.

jasomill · on July 12, 2022

Of the ~20 assorted Deskstar and related Ultrastar drives of that vintage I've had in regular use over the past decade, exactly one has failed...because I dropped it.

HGST SAS SSDs (the ones that pair Intel NAND and Hitachi SAS controllers) have also been reliable performers in my home office experience, even without (RAID controller support for) UNMAP (SCSI "TRIM"). Incidentally, these now appear to be selling on eBay for more than I paid for them "lightly used" several years ago.

dredmorbius · on July 11, 2022

Surprisingly few HN submissions under either term:

IBM Deskstar: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

IBM Deathstar: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

solardev · on July 11, 2022

Back then people still used Slashdot:

Wondering if it's real... https://m.slashdot.org/story/20680

Years later, it's a widespread phenomenon: https://m.slashdot.org/story/43312

It was affectionately called the "click of death".

thrwyoilarticle · on July 11, 2022

>But don't we all love them now because they support linux?

Ah, inventing an opinion to get angry about: some things never change.

xyzzy_plugh · on July 11, 2022

Why would there be submissions? These drives predate HN.

dredmorbius · on July 11, 2022

HN occasionally discusses issues pre-dating itself.

dredmorbius · on July 11, 2022

... and the lifetime (deathtime?) award for the Deathstar only predated HN by a few months:

May 26, 2006: https://www.pcworld.com/article/535838/worst_products_ever.h...

October 9, 2006: https://news.ycombinator.com/item?id=1

Memory would still have been reasonably green.

vidarh · on July 11, 2022

It does however come up in comments regularly. I know I've brought it up more than once, because I had a week long ordeal replacing all the drives in an array as they died one by one back in the day.

mst · on July 11, 2022

The deathstars were fantastic, they almost always failed on the outer edges of the platters.

So if you only formatted them (filesystem wise) out to capacity-2Gb they were a really cheap option at the time.

lostlogin · on July 11, 2022

That’s a very different definition of fantastic than the one I use.

mst · on July 11, 2022

Once I'd worked it out - which was after the problems were public and therefore the price had utterly cratered - they were by far the cheapest storage per Gb available at the time (think "by a factor of two").

I would not have let a normal business user near one, but the developers I was supporting were most pleased about their larger than expected scratch disks for test databases and intermediate compilation artifacts.

Everything breaks. Things that at least break predictably make me happier than the alternative.

somat · on July 11, 2022

The way I always put it is, you have identical drives, with identical usage, powered for identical times. and you are still surprised when a second drive fails under the high stress environment of rebuilding after the first drive fails.

pmlnr · on July 11, 2022

Heh, no. We had a fleet of HPE Cloudline (CL3100) failing at the same time because the SSDs exhausted the writes.

pca006132 · on July 11, 2022

perhaps they are the same model. iirc people recommend not to use the same model of hardware to provide redundancy.

wyager · on July 11, 2022

I generally replace HDDs in my personal zpool a few days apart, for this reason. I also order them from different suppliers, so I can get different manufacture dates.