Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What follows from empirical software research? (jimmyhmiller.github.io)
54 points by jimmyhmiller on April 22, 2023 | hide | past | favorite | 26 comments


One of the most important lessons I've learned in my career is that if a common problem has existed for a long time, a simple solution has probably been hiding in plain sight for a long time somewhere you haven't thought to look. I don't know anything about productivity research, but the fundamentals of defect density were figured out decades ago [1].

> (a) there’s no difference between test-first and test-after but (b) interleaving short bursts of coding and testing is more effective than working in longer iterations

I'm glad you quoted this study because it's a perfect example of a conclusion that should have been the starting point of a more interesting experiment. Interleaving code and test is known to be the most impactful factor. TDD and TRL don't differ in any of the ways that account for the vast majority of defects. Therefore of course the difference should be small, and of course shorter iterations should be better.

[1] https://www.slideshare.net/AnnMarieNeufelder/the-top-ten-thi...


Replying to my own comment to explain how it directly relates to the asked question - what to do with empirical research?

The key takeaway is to evaluate everything in terms of sensitivity because it gives you design insight. The above quoted study identifies one insensitive factor (order) and one sensitive factor (iteration length) of interleaving test and code. Now you can choose, modify, or design any test method as long as you do short interleaving. The order is something you don't have to worry about. If the research doesn't reveal anything about the sensitivity, don't worry about it until someone figures it out, your beliefs are probably wrong and inconsequential anyway.


See, for me this is an example of how the hyper-empirical, sciencify-everything mindset is great for getting at truth but less great at giving advice.

The default, intuitive approach is to write a whole bunch of code and then test it. When people start doing TDD, they struggle with only making enough changes to the code to pass the test. But it helps a lot that the test suite tells them when to switch from writing code (which they prefer) back to writing tests (which they don't prefer) by passing.

Then we discover that the main reason TDD works is that it gets you to interleave coding and testing in small batches, and it works just as well to reverse the order of the batches but keep them small.

And then, somehow, knowing why TDD works (better than what people do by default) gets translated into "TDD doesn't work" (better than something that's carefully controlled to be exactly the same, except for the part that gets people to do the rest of it). And most of the people who hear that go back to writing a whole bunch of code and then testing it.

Because the real world doesn't control all the variables, so we often have to think about factors that the rigorous research is silent on because they can help or hinder us in achieving what the research says matters.


As with any research, there are many possible audiences and it's not an article targeted at you. It's your job to already be educated in a way that's relevant to yourself.

The link I posted shows that the research has pretty comprehensively determined the relative impact of various factors, and I gave an example of how to interpret new research providing single data points within that framework. So now you should have the background that you need to read the literature with an eye for your application.


> Assume for a second that a study of deep relevance to practitioners is replicated, and its conclusions accepted by the research community. It has large sample sizes, a beautiful research plan, a statistically sound system for controlling for various other explanatory factors

As a multi-decade practitioner and manager of software engineering teams, I would certainly be interested in what the best of the empirical research has to say. I think it's important to always remain open to good new ideas wherever they may come from—strong opinions, loosely held, as they say.

That said, I don't believe that statistically significant results can be found that will overturn my own instincts and judgement on any specific project to which I am dedicated. The reason for this is threefold: 1) the universe of software and goals we pursue with is astronomically large 2) competence in software engineering depends on the combination of personal aptitudes and mindsets combined with years of practice and 3) measuring outcomes in software engineering across diverse projects is all but impossible. In other words, you can't equate tools, you can't equate projects, and most of all you can't equate people.

At the end of the day, success in software engineering comes from relentless focus on the specific goals at hand. One must be inherently curious and have a craftsperson's mentality about acquiring technical skill, but never become religious about methodology. This requires continuous first-principles thinking targeted at specifics. At the end of the day, two expert practitioners could propose unorthodox and diametrically opposed approaches to the same problem, and they would still dramatically outperform a lesser skilled journeyman who attempted to follow every best practice.

Empirical studies and the scientific method in general work fantastically well for uncovering the rules and inner workings of the natural world, but software is the creation of logical systems purely by human minds which is an entirely different challenge—there's just not enough evidence to draw on. I suspect results will be at least a couple orders of magnitude softer than sociology, and that probably won't sit well with the type of personality attracted to software in the first place.


> That said, I don't believe that statistically significant results can be found that will overturn my own instincts and judgement on any specific project to which I am dedicated. The reason for this is threefold: 1) the universe of software and goals we pursue with is astronomically large 2) competence in software engineering depends on the combination of personal aptitudes and mindsets combined with years of practice and 3) measuring outcomes in software engineering across diverse projects is all but impossible. In other words, you can't equate tools, you can't equate projects, and most of all you can't equate people.

This was pretty much the word-by-word argument against using statistical approaches to price insurance of shipments over the sea back in the 1700s. Yet we all know how insurance premiums are calculates today, and there's a reason for it.


Word-for-word? Really? Sorry, I just don't understand your point.

Are you saying merchants in the 1700s didn't believe insurance outcomes were quantifiable? Or are you saying that software engineering output is quantifiable? If the latter, maybe you might shed some light on how you think that would work, I'm happy to be proven wrong.


"Is success due to simple rules applied rigorously, or individual heroics?"

Yes!


Do you think methodology in general isn't useful or just empirical software research?

Methodology learns from the experience of the experts, and tries to teach the techniques they know to beginners. It's very different from statistics.


> Assume for a second that a study of deep relevance to practitioners is replicated, and its conclusions accepted by the research community. It has large sample sizes, a beautiful research plan, a statistically sound system for controlling for various other explanatory factors; whatever it is that we need to proclaim it to be a good study.

Has there ever been an empirical software study, that would have a beautiful research plan, sound statistical analysis, large sample size, and that would also have been replicated? Even one?


For personal software development projects of course there are other factors that matter beyond finding the theoretically optimal X and Y. From a management perspective in business those factors might also matter. You want to get the best performance out of your team but you're not going to do that if people keep quitting due to an unpleasant work environment.

As far as advocacy goes though - when someone is recommending what other people should do - I think it's very different if there is relevant evidence and it doesn't back up the advocated position. It's even more different if there is relevant evidence that positively undermines the advocated position. There are snake oil salesmen in this industry and some of them will call you names if you don't follow their pet process. But if what they're peddling isn't backed by the evidence or even contradicts the evidence then they should be called out and their audience should probably be sceptical about anything else those same salesmen are selling as well. The old joke about someone finding it hard to believe something when their continued employment depends on its falsehood is as relevant as ever.


"Empirical software research" could mean a bunch of different things. This article is about studying people writing software, not about software used for research in empirical sciences, and not about research into computer science.

I'm confident the answer to what follows from that is "nothing yet" based on various conference talks. Studying developers (or in a worse case students) writing software doesn't seem to be an effective way of working out how to write software better/faster/whatever.


I think this is what is being meant;

Wikipedia: https://en.wikipedia.org/wiki/Empirical_software_engineering (ESE)

Popular: https://www.americanscientist.org/article/empirical-software...

Microsoft has a ESE group with some interesting publications. I had read/downloaded a couple a while ago which unfortunately i can't now remember nor locate. But this is a good starting point - https://www.microsoft.com/en-us/research/publication/belief-...


I obviously have not looked at all (or really much) of this research, but I have always felt that software development is so context dependent that drawing generalised conclusions is just an impossible ask in the first place. Even if you did, there would be so many exceptions that in practice it will come down to "use your experience to assess the context and then decide".

The example of TDD : I've done significant pieces of work both with and without TDD. In some scenarios its a huge impediment; the actual complexity of the internal software is fairly low and the but the testing complexity is high (many complex stateful dependencies that are hard to control). In that case, I spent 80% of my time writing the tests and far more bugs surfaced in the tests than the code itself.

Then in other scenarios there's high internal complexity, low external dependencies / complexity and it's pretty much a no-brainer, TDD is almost the only tractable way to write the code let alone an improvement.

Then it's very personal as well. One person will work well with TDD and another will struggle. Dumb things like, is your personal preference in development environment conducive to rapidly running and iterating on tests are probably going to dominate.

End result is, I think these studies just can't possibly control all the variables and this is why they either end up in invalid conclusions, too specialised conclusions, or, as Jimmy says, the more rigorous the study the less significant the result are.


It seems to me that you misunderstand the purpose of statistical generalizations in the first place? The purpose of looking at multiple cases is to derive common patterns regardless of the differences in detail among individual cases. Which is why, yes, you still need to apply the inferences into a specific context if you need to make a decision, but to say that nothing can be gained from a comparative big-picture approach is a malicious withdrawal from critical thinking.


I don't know if TDD is effective. It seems effective on my single-person projects, less so on multi-person projects.

However, TDD is pretty effective in working with chatGPT. I always tell it to write the tests first.


Does any software engineering research take into account the human factors of context, interest, exhaustion, and aptitude?


i doubt it. Similarly to the project trinity - functionality, price, time - that never included Fun ..


The author's overthinking it. He cares about productivity—it's just that the effect sizes are too small in these studies to overcome his prior beliefs.


In the article I’m assuming ideal conditions. So if you think a large effect size is important, throw that into the ideal conditions. I don’t think that changes anything I wrote. Maybe I’m missing something?


If you throw that in, you’d need to throw the TDD example out. A stronger test of your arguments would be, eg, research on productivity impact of generative AIs like copilot.

(This would grow your example to match your argument, whereas my initial post would shrink your argument to match your example.)


The TDD example was just an example. I wasn’t taking the actual finding of the TDD research I was talking about a hypothetical finding. You can replace the examples with AIs like copilot and stipulate a large effect size. Nothing I said would really change. You still have to look at your desires to figure out what you ought to do given that research.


We're talking past each other, here.

Your article presents a sufficient argument for dismissing the TDD example. Tossing effect size in, your argument still applies for dismissing an ideal study.

My point is not that what you said didn't suffice, it's that the philosophically heavyweight arguments weren't necessary. They were a hand-written recursive-descent parser, when the example could have been solved with a regex.


Yeah, you could very well be right this is all overkill. But what I wanted to do was handle the general case rather than deal with things on a case by case basis. So to extend your analogy, I consider this blog post a parser generator. It isn't suppose to parse any particular text, but to allow you to create a parser for anything you want.

I've found that my short replies to empirical advocates seem to not connect. There seemed to be an assumption that I just wanted to hold onto my opinions and not think deeply about anything. So I wanted to do the exact opposite. Be as pedantic as I could (in the space of a shortish blogpost).


What a lot of words to justify continuing preaching TDD despite no evidence that it's better. (Guess what, it's not worse either, so if you want to personally use it, go for it. Just stop insisting other people should "convert".)

Of course if something has a huge impact on my productivity, I want to practice it. Even if it's not fun. There's a lot of denial embedded on this article.


I think you missed the point of the article. I actually don’t do TDD at all. I’m actually personally not a fan. So feels a little weird that you thought this was an article trying to justify TDD use.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: