> If you have a condition that takes, on average, 1000 hours to occur, you have ...

lxgr · on March 20, 2024

> Which means 9/10 users who used their machines 100 or fewer hours on the new kernel will never hit the bug.

No, that's not how probabilities work at all for a bug that happens with uniform probability (i.e. not bugs that deterministically happen after n hours since boot). If you have millions of users, some of them will hit it within hours or even minutes after boot!

> As you correctly point out, my deployment is of negligible size, and the long tail is far far beyond my reach.

So you don't expect to accrue on the order of 1000 machine-hours in your deployment? That's only a month for a single machine, or half a week for 10. That would be way too much for me even for my home server RPi, let alone anything that holds customer data.

> I'm not a filesystem developer, I'm a user: I don't care about the long tail, I only care about the average case as it relates to my deployment size.

Yes, but unfortunately you seem to either have the math completely wrong or I'm not understanding your deployment properly.

jcalvinowens · on March 20, 2024

> So you don't expect to accrue on the order of 1000 machine-hours in your deployment?

The 1000 number came from you. I have no idea where you got it from. I suspect the "real number" is several orders of magnitude higher, but I have no idea, and it's sort of artificial in the first place.

My overarching point is that mine is such a vanishingly small portion of the universe of machines running btrfs that I am virtually guaranteed that bugs will be found and fixed before they affect me, exactly as happened here. Unless you run a rather large business, that's probably true for you too.

The filesystem with the most users has the least bugs. Nothing with the feature set of btrfs has even 1% the real world deployment footprint it does.

> If you have millions of users, some of them will hit it within hours or even minutes after boot!

This is weirdly sensationalist: I don't get it. Nobody dies when their filesystem gets corrupted. Nobody even loses money, unless they've been negligent. At worst it's a nuisance to restore a backup.

lxgr · on March 20, 2024

> The 1000 number came from you. I have no idea where you got it from,

It's an arbitrary example of an error rate you'd have a 90% chance of missing in your sample size of 100 machine-hours, yet much too high for almost any meaningful application.

I have no idea what the actual error rate of that btrfs bug is; my only point is that your original assertion of "I've experienced 100 error-free hours, so this is a non-issue for me and my users" is a non sequitur.

> This is weirdly sensationalist: I don't get it. Nobody dies when their filesystem gets corrupted. Nobody even loses money, unless they've been negligent.

I don't know what to say to that other than that I wish I had your optimism on reliable system design practices across various industries.

Maybe there's a parallel universe where people treat every file system as having an error rate of something like "data corruption/loss once every four days", but it's not the one I'm familiar with.

For better or worse, the bar for file system reliability is much, much, much, much higher than anything you could reasonably produce empirical data for unless you're operating at Google/AWS etc. scale.

jcalvinowens · on March 20, 2024

> "I've experienced 100 error-free hours, so this is a non-issue for me and my users"

It's a statement of fact: it has been a non-issue for me. If you're like me, it's statistically reasonable to assume it will be a non-issue for you too. Also, no users, just me. "Proabably okay" is more than good enough for me, and I'm sure many people have similar requirements (clearly not you).

I have no optimism, just no empathy for the negligent: I learned my lesson with backups a long time ago. Some people blame the filesystem instead of their backup practices when their data is corrupted, but I think that's naive. The filesystem did you a favor, fix your shit. Next time it will be your NAS power supply frying your storage.

It's also a double edged sword: the more reliable a filesystem is, the longer users can get away without backups before being bitten, and the greater their ultimate loss will be.

lxgr · on March 20, 2024

> It's a statement of fact: it has been a non-issue for me.

Yes...

> If you're like me, it's statistically reasonable to assume it will be a non-issue for you too.

No! This simply does not follow from the first statement, statistically or otherwise.

You and I might or might not be fine; you having been fine for 100 hours on the same configuration just offers next-to-zero predictive power for that.

jcalvinowens · on March 20, 2024

> No! This simply does not follow from the first statement, statistically or otherwise.

> You and I might or might not be fine; you having been fine for 100 hours on the same configuration just offers next-to-zero predictive power for that.

You're missing the forest for the trees here.

It is predictive ON AVERAGE. I don't care about the worst case like you do: I only care about the expected case. If I died when my filesystem got corrupted... I would hope it's obvious I wouldn't approach it this way.

Adding to this: my laptop has this btrfs bug right now. I'm not going to do anything about it, because it's not worth 20 minutes of my time to rebuild my kernel for a bug that is unlikely to bite before I get the fix in 6.9-rc1, and would only cost me 30 minutes of time in the worst case if it did.

I'll update if it bites me. I've bet on much worse poker hands :)

lxgr · on March 20, 2024

Well, from your data (100 error-free hours, sample size 1) alone, we can only conclude this: “The bug probably happens less frequently than every few hours”.

Is that reliable enough for you? Great! Is that “very rare”? Absolutely not for almost any type of user/scenario I can imagine.

If you’re making any statistical arguments beyond that data, or are implying more data than that, please provide either, otherwise this will lead nowhere.

Dylan16807 · on March 21, 2024

> I only care about the expected case.

The expected case after surviving a hundred hours is that you're likely to survive another hundred.

Which is a completely useless promise.

That piece of data doesn't let you predict anything at reasonable time scales for an OS install.

You can't squeeze more implications out of such a small sample.

jcalvinowens · on March 22, 2024

I don't care about the aggregate: I only care about me and my machine here.

> The expected case after surviving a hundred hours is that you're likely to survive another hundred.

That's exactly right. I don't expect to accrue another hundred hours before the new release, so I'll likely be fine.

> Which is a completely useless promise.

Statistics is never a promise: that's a really naive concept.

> at reasonable time scales for an OS

The timescale of the OS install is irrelevant: all that matters is the time between when the bug is introduced and when it is fixed. In this case, about nine months.

Dylan16807 · on March 22, 2024

You only use your machines for twenty hours per month?

Even so, "likely" here is something like "better than 50:50". Your claim was "very very rare" and that's not supported by the evidence.

> Statistics is never a promise: that's a really naive concept.

It's a promise of odds with error bars, don't be so nitpicky.

jcalvinowens · on March 22, 2024

> Even so, "likely" here is something like "better than 50:50". Your claim was "very very rare" and that's not supported by the evidence.

You're free to disagree, obviously, but I think it's accurate to describe a race condition that doesn't happen in 100 hours on a multiple machines with clock rates north of 3GHz as "very very rare". That particular code containing the bug has probably executed tens of millions of times on my little pile of machines alone.

> It's a promise of odds with error bars, don't be so nitpicky.

No, it's not. I'm not being nitpicky, the word "promise" is entirely inapplicable to statistics.

Dylan16807 · on March 22, 2024

If my computer has a filesystem error that happens every week of uptime (168 machine hours), I call that "common".

bmicraft · on March 21, 2024

> Nothing with the feature set of btrfs has even 1% the real world deployment footprint it does.

So you haven't heard of zfs then?

kbolino · on March 20, 2024

A single 9 in reliability over 100 hours would be colossally bad for a filesystem. For the average office user, 100 hours is not even a month's worth of daily use.

Even as an anecdote this is completely useless. A couple thousand hours and dozens of mount/unmount cycles would just be a good start.

paulddraper · on March 20, 2024

> Yes. Which means 9/10 users who used their machines 100 or fewer hours on the new kernel will never hit the hypothetical bug. Thank you for proving my point!

So....that's really bad.