Hi, blog post author here (unrelated to paper authors). > So in a laps of 15 yea...

Galanwe · 2025-03-08T13:28:50 1741440530

> The paper was published in 2014, so the period is 2010-2014, not 2010-2025.

Oh indeed, my bad.

> The bugs were randomly chosen; "hand picked" would imply that the authors investigated the contents of the bug reports deeply before deciding whether to include them

No, no, they were not randomly chosen, they even have a whole paragraph explaining how they randomly picked _from a pool of manually selected bugs_. Their criterion of selection varied from "being serious", to "having a lot of comments" and "they can understand the patch", or "the patch is not from the reporter".

> The authors also provide a measure of statistical confidence in the 'Limitations section'.

This is a measure of confidence of their random sampling being representative of the hand picked bugs...

> under standard assumptions, the Central Limit Theorem predicts a 6.9% margin of error at the 95% confidence level

I would love to see them prove the normality assumption of bug root cause distribution.

Also, the whole categorization they do seem purely qualitative.

This paper seems to lack rigor to me.

gtirloni · 2025-03-08T13:53:05 1741441985

> The whole paper is a statistical joke.

I have been guilty of such kind of comments and I have come to realize they do nothing to further my point. I would suggest to read the guidelines because they are very well written about this: https://news.ycombinator.com/newsguidelines.html

Galanwe · 2025-03-08T14:31:47 1741444307

Fair point, I edited my answer to be less dismissive of the paper.

Thanks for the considerate feedback.

epr · 2025-03-08T13:38:28 1741441108

> We studied 198 randomly sampled, real world fail- ures reported on five popular distributed data-analytic and storage systems, including HDFS, a distributed file system [27]; Hadoop MapReduce, a distributed data- analytic framework [28]; HBase and Cassandra, two NoSQL distributed databases [2, 3]; and Redis, an in- memory key-value store supporting master/slave replica- tion [54]

So, analyzing databases, we picked Java, Java, Java, Java, and one in C. This does not seem very random. I suppose this may provide insight into failure modes in Java codebases in particular, but I'm not sure I'd be in a hurry to generalize.