A Brief History of the Current Empirical Software Engineering Science Turf War

fanf2 · on Nov 28, 2019

There's another review of the same study/replication/rebuttal at http://shape-of-code.coding-guidelines.com/2019/11/20/a-stud... which agrees that the rebuttal is unconvincing and the replication study rightly showed the original studies were seriously flawed.

hwayne · on Nov 28, 2019

The day after I wrote this, the replication authors came out with their rebuttal rebuttal: http://janvitek.org/var/rebuttal-rebuttal.pdf

wwwigham · on Nov 28, 2019

> The projects should be chosen by controlling their characteristics rather than relying on GitHub “stars” which capture popularity and are unrelated to software development.

> “stars” ... are ... unrelated to software development.

I wouldn't call sentiment about software _entirely_ unrelated to software development - minimally I'd assume that more popular projects, receiving more traffic due to higher engagement, would also receive more scrutiny, and thus have more bug(fixe)s. That would be an interesting characteristic to attempt to select for, no?

> Part of issue 9 is left unchallenged: 34% of the remaining TypeScript commits are to type declarations. In TypeScript, some files do not contain code, they only have function signatures. These are the most popular and biggest projects in the dataset. We corrected for this by removing TypeScript.

Those are... Still code? At least as much as header files in c are, and they can totally still contain bugs. Now, if we're talking about how they're often redistributible copies (like headers), especially way back at the time of the paper before @types came about, and how duplication due to that may inflate meaningful SLOC, that's interesting. Would be nice to see a good argument for it to be corrected and controlled, rather than dropped outright. Then again, TS has grown a ton in the intervening years, and the "largest" repos listed are... Very unused now. Wholly outdated. Entirely obsolete. You'd probably get different results with fresher data, too.

> The classification of languages is wrong: consider Scala. In FSE, it is lumped with Clojure, Erlang & Haskell under the “Functional Paradigm”. For this to be meaningful, there must exist some shared attribute these languages have that makes programs written in them similar. Referential transparency and higher order functions could be that. But, while Scala has higher-order functions, it is imperative. So, it is not a perfect match. Worse: Java also has higher-order functions, yet it isn’t in that group.

Many newer languages are multiparadigm, but I'd probably still say that Java encourages the imparitive OO style, while Scala prefers the functional composition style. I don't think "a single unifying characteristic" really helps, given the feature bleed between modern languages. I think you only get a great classification by examining, eg, what the community style standards are.

It'd be nice if open data studies like the one under discussion here were reliably posted in a format where the analysis could be easily recalculated, readjusted, tweaked, and represented. It's the kinda thing that could be neat to play around with in a notebook.

TeMPOraL · on Nov 28, 2019

Github stars are first and foremost bookmarks. Not expression of sentiment, not votes of quality, just bookmarks. You see an interesting repo, you star it, so that you can find it later.

jrumbut · on Nov 28, 2019

Some things are worth knowing and some things are easy to measure, rarely do they overlap. Number of commits mentioning bugs may not be either, but given how little hard data there is out there I applaud any effort.

Someday I hope a researcher convinces a few companies to open up their JIRA histories and is able to really study project completion/success/user reports of bugs.

kortilla · on Nov 28, 2019

> Number of commits mentioning bugs may not be either, but given how little hard data there is out there I applaud any effort.

I disagree. The original papers were widely cited and contained such serious flaws that they were actively harmful. Assuming a commit message messaging an “infix operator” is a fix for a bug is just grossly negligent.

ncmncm · on Nov 28, 2019

tl;dr: Statistical analyses of Github are still a dumpster fire of flawed metrics. Regex matching of commit messages substitutes for bug counts, regex matching of source files substitutes for language identification. It is garbage-in, garbage-out, with no defensible conclusions.

cryptica · on Nov 28, 2019

These studies are bs because they capture corellations and confuse them with causations.

Good studies need controlled laboratory conditions to remove external variables.

For example, maybe FP projects are written by coders who have more years of experience on average. So in that case it's not about the language paradigm, any facts derived from the study would actually be about the individuals who write in that language. And it's not surprising at all that more experienced individuals write fewer bugs. This is just one factor that may be ignored; there are probably hundreds that can cause significant distortions in the results.

It would be like saying that Norwegian is a highly effective language because they have a very high GDP per capita. This completely ignores the more significant fact that the country has vast amounts of offshore oil to capitalize on.

xgk · on Nov 28, 2019

   studies are bs because

We should not let the perfect be the enemy of the good. The problem is that we do not currently know how to carry out studies on the efficacy of programming language paradigms under "controlled laboratory conditions". It's known to be hard. Feel free to change this and become famous! The studies discussed in the article are but first steps towards:

- a better empirical grounding of programming language & software engineering research;

- more emphasis on reproducibility in science.

I'm glad that Vitek, Berger et al are starting serious empirical PL/SE research, and care about reproducibility! Bravo!

calpaterson · on Nov 28, 2019

A study that captures a correlation instead of causation directly is not completely useless. It seems wise to try to draw some conclusions from the massive corpus of OSS software that is available given that it's hard to imagine any researcher will ever get the budget to do even one controlled test on a meaningfully sized software project.

Anyway, it's not hard to imagine an improved version of the study that knows about the years of experience of each contributor (even if that necessitates reducing the size of the dataset a bit) and controls for it. I suppose that would be the "empirical software engineering" equivalent of econometrics.