Hi, blog post author here (unrelated to paper authors).
> So in a laps of 15 years (2010-1025),
The paper was published in 2014, so the period is 2010-2014, not 2010-2025.
> they hand picked 20 bugs from 5 open source filesystem projects (198 total)
The bugs were randomly chosen; "hand picked" would imply that the authors investigated the contents of the bug reports deeply before deciding whether to include them (which would certainly fall under "bad science"). The paper states the following:
> We studied 198 randomly sampled, real world fail-
ures reported on five popular distributed data-analytic
and storage systems, including HDFS, a distributed file
system [27]; Hadoop MapReduce, a distributed data-
analytic framework [28]; HBase and Cassandra, two
NoSQL distributed databases [2, 3]; and Redis, an in-
memory key-value store supporting master/slave replica-
tion [54]
So only 1 out of 5 projects is a file system.
> and extrapolated this result. That is not science.
The authors also provide a measure of statistical confidence in the 'Limitations section'.
> (3) Size of our sample set. Modern statistics suggests that
a random sample set of size 30 or more is large enough
to represent the entire population [57]. More rigorously,
under standard assumptions, the Central Limit Theorem
predicts a 6.9% margin of error at the 95% confidence
level for our 198 random samples. Obviously, one can
study more samples to further reduce the margin of error
Do you believe that this is insufficient or that the reasoning in this section is wrong?
> The paper was published in 2014, so the period is 2010-2014, not 2010-2025.
Oh indeed, my bad.
> The bugs were randomly chosen; "hand picked" would imply that the authors investigated the contents of the bug reports deeply before deciding whether to include them
No, no, they were not randomly chosen, they even have a whole paragraph explaining how they randomly picked _from a pool of manually selected bugs_. Their criterion of selection varied from "being serious", to "having a lot of comments" and "they can understand the patch", or "the patch is not from the reporter".
> The authors also provide a measure of statistical confidence in the 'Limitations section'.
This is a measure of confidence of their random sampling being representative of the hand picked bugs...
> under standard assumptions, the Central Limit Theorem predicts a 6.9% margin of error at the 95% confidence level
I would love to see them prove the normality assumption of bug root cause distribution.
Also, the whole categorization they do seem purely qualitative.
I have been guilty of such kind of comments and I have come to realize they do nothing to further my point. I would suggest to read the guidelines because they are very well written about this: https://news.ycombinator.com/newsguidelines.html
> We studied 198 randomly sampled, real world fail- ures reported on five popular distributed data-analytic and storage systems, including HDFS, a distributed file system [27]; Hadoop MapReduce, a distributed data- analytic framework [28]; HBase and Cassandra, two NoSQL distributed databases [2, 3]; and Redis, an in- memory key-value store supporting master/slave replica- tion [54]
So, analyzing databases, we picked Java, Java, Java, Java, and one in C. This does not seem very random. I suppose this may provide insight into failure modes in Java codebases in particular, but I'm not sure I'd be in a hurry to generalize.
> So in a laps of 15 years (2010-1025),
The paper was published in 2014, so the period is 2010-2014, not 2010-2025.
> they hand picked 20 bugs from 5 open source filesystem projects (198 total)
The bugs were randomly chosen; "hand picked" would imply that the authors investigated the contents of the bug reports deeply before deciding whether to include them (which would certainly fall under "bad science"). The paper states the following:
> We studied 198 randomly sampled, real world fail- ures reported on five popular distributed data-analytic and storage systems, including HDFS, a distributed file system [27]; Hadoop MapReduce, a distributed data- analytic framework [28]; HBase and Cassandra, two NoSQL distributed databases [2, 3]; and Redis, an in- memory key-value store supporting master/slave replica- tion [54]
So only 1 out of 5 projects is a file system.
> and extrapolated this result. That is not science.
The authors also provide a measure of statistical confidence in the 'Limitations section'.
> (3) Size of our sample set. Modern statistics suggests that a random sample set of size 30 or more is large enough to represent the entire population [57]. More rigorously, under standard assumptions, the Central Limit Theorem predicts a 6.9% margin of error at the 95% confidence level for our 198 random samples. Obviously, one can study more samples to further reduce the margin of error
Do you believe that this is insufficient or that the reasoning in this section is wrong?