Eversince that backblaze post I've been monitoring only these five relevant SMART values and replace a disk once I have 2 of these values being in the not-normal range. I haven't had a disk failing on me ever since.
Before that, I had lost all trust in SMART, as it would report drives as "healthy" that could fail next day.
While the project is cool looking it's totally usele^W^W usefulness is quite questionable.
In my life I've seen totally dead disks with ideal SMART and disks with SMART indicating what I should had threw them out already yet they worked fine for a literally decade after that. I've configured monitoring and reporting for SMART attribute changes, I've used all available SMART tools on the platforms to diagnose some acting up HDDs. I've replaced drives in the computers, servers, blades and SANs. I've ordered and/or supervised the replacement of drives by padawans and vendor engineers.
You know what I never did?
I never did had or want a dashboard with SMART statuses of all disks I'm responsible for. Because if the drives are healthy I have absolutely zero reason to watch that dashboard. And if some drive is acting up I just need a notification about that (and which disk and where it is) to order the replacement.
Sure, for I'm in some different boat than the author of that project and it can be a cool nerd-party trick to show the SMART status of all of your 48 HDDs in your r/datahoard box... For all other situations (especially if you are on the operations side of things) look for a more classical (and probably more mature?) solutions.
There's two kinds of SMART errors: soft errors, like read retries, occur in normal operation and tell you more about the environment the drive is operating than the health of the drive. Hard errors, like Reallocated sectors, never occur in normal operation so the first time the drive logs one that means the drive is failing and you need to replace it. Monitoring for the first kind of error is pointless, but closely monitoring for the second kind of error will improve your chances of replacing a drive before it loses your data. It's just unfortunate that the error thresholds built into drives won't throw a SMART failure warning before the drive is completely dead.
It's a very different story for SSDs. Those, you definitely do want to monitor in case your workload is burning through the rated write endurance faster than planned for. And reallocated sectors are usually not an urgent problem: a handful early in the life of the drive can be a normal consequence of vendors not aggressively testing (excessively wearing out) a drive before it leaves the factory, and a steadily increasing number as a drive approaches end of life is expected behavior.
But SMART errors usually won't help you know when you're about to lose an SSD to a catastrophic firmware bug, which for many use cases is the more likely cause of death.
While this is already very useful for my homelab setup, I’d like to see if this could be deployed on multiple servers and communicate the status to a master node to watch over all my servers.
If this has something I could tap into to get the status info externally , could privately hack something together to get that.
This looks amazing! Do you have a sample kubernetes deployment or helm chart we could use? I'd love to deploy the metrics collector on my k3s and have all my servers stream in data.
It has plenty of both input and output plugins, and it's tiny. Writing new plugins for it is also a breeze, you just need to have app returning a line of text.
This is kind of an unfortunate name. Broadcom has a utility for getting diagnostic data from their latest gen SAS expanders called scrutiny. I expect once that becomes more popular you're going to see some confusion around which "scrutiny" project involving hard drives you're talking about.
I've had a lot of hard drives, and seen a lot of them fail, and not once has SMART-monitoring software given me the slightest amount of warning beforehand.
I've been plenty of drives (not sure if call it "a lot" given how many years this is over and how many others have even in home environments) and some have given SMART warnings before otherwise showing issues (or, in most cases, have been replaced before other issues because of such warnings).
A couple have had failures (both full "death" and less systemic errors) without anything being reported by SMART beforehand, but that is the nature of the beast. I suspect you have just been less lucky.
Relevant because too many different metrics can be hard to make sense of and derive value from. Backblaze have a lot of hard drives. Like a lot a lot. https://www.backblaze.com/blog/how-backblaze-buys-hard-drive... (2019)