State of Mutation Testing at Google (2017)

modeless · on July 13, 2018

This tool seems like it would encourage writing exactly the kind of tests I hate: change detectors. Tests that break every time you change the code regardless of whether the change introduced a bug or not.

Change detector tests are worse than useless because they double the time it takes to make a code change, while not providing any useful information. I already know I changed the code, I don't need a test to tell me that.

We should aspire to write tests that aren't just change detectors, but catch real bugs. Perhaps a tool like this could be useful in an inverse way, to tell you that your test is sensitive to irrelevant details and should be made more general.

ur-whale · on July 13, 2018

I understand exactly what you mean, and I used to have the same negative attitude towards "change detectors", but I have actually changed my mind about them and come to appreciate their value.

I suspect the root cause of your headache with them is that you hate to see tests break when you change your code, probably because you view fixing the test as costly and a distraction.

But I've come to see breaking "change detectors" as a source of comfort, sort of confirming what I expect my change to do.

What you need to get there is to make test "fixing" more agile and less costly. I do understand that in a large shared codebase, that is easier said than done, but on smaller projects that's easier to do.

Eridrus · on July 13, 2018

The one time this system complained about my tests it found a real bug in my test, though not an actual issue with the program under test.

rossjudson · on July 13, 2018

Effective code tends to put a lot of functionality behind a relatively small interface, and tends to work for a long time without needing attention.

A "change detector" is exactly what you want in this situation. Does the code still fulfill the spec (tests)?

Tests that feel like they need to be changed every time the code changes are generally white box instead of black box, or the interface to the code being tested is too large.

jjoonathan · on July 13, 2018

> Effective code tends to put a lot of functionality behind a relatively small interface

Do you have actual strategies for reducing high surface-area:volume requirements into more easily managed low surface-area:volume code or is this just wishful thinking?

Nobody likes shim layers but they (and other horizontal strata) happen for a reason.

golergka · on July 13, 2018

That's exactly what I like about good tests: they serve as a double-entry bookkeeping system. Telling me that I changed something is very useful, because they help me verify that I changed only the things that I wanted to change.

stgpetrovic · on July 13, 2018

I completely agree that change detector tests are worse than useless, they often have negative impact.

When designing mutation testing, we have spent most of our time designing suppresion heuristics to avoid generating bad mutants. Some bad mutants fall into the category of change-detector tests, but there are many more categories (see the Appendinx for a taste).

Finally, mutants don't have to be killed, they just point out a potential weakness in the test suite, and it's up to the developers to make that call.

forkerenok · on July 13, 2018

First, it's worth noting that the output of the mutation test is also an heuristic.

Secondly, I have really hard time imagining a case where, a change in branching logic in code, that doesn't trigger test failure, is not signalling a "coverage smell".

Edit: And I do totally relate to what you are saying about change traps. And, I think, if such practice is already prevalent in the codebase, mutation testing will exacerbate it.

nitwit005 · on July 14, 2018

It depends on how the tests work. If they all treat the program as a black box, there should be no problem. If the tests are unit tests that use a lot of mocks, then yes, this sort of methodology might produce crappy results.

systemtest · on July 13, 2018

I agree, I have seen too many unit tests that copy the implementation of the method they test. TDD might solve this.

Vinnl · on July 13, 2018

I feel like "mutation testing" is somewhat of a misnomer, since it sounds like it's a form of testing and is e.g. complimentary to or a replacement for unit testing. Rather, it's a measure of code coverage by your test suite.

As such, it's mostly useful when you've exhausted other, easier methods of finding parts of your code you forgot to test, such as regular line, branch, etc. coverage. I think there's few projects for which that's the case, and that that more than anything "hindered" the adoption of mutation testing.

SloopJon · on July 13, 2018

Although I'm sure I've heard the term before, I forgot what it meant. Your comment helped me put it into context as a fault injection technique.

One point the paper makes in the introduction is that "coverage alone might be misleading, as in many cases where statements are covered but their consequences not asserted upon." To satisfy profile-guided coverage (e.g., gcov), the test doesn't have to be correct or useful, it just has to execute the line or take the branch.

deckarep · on July 13, 2018

I am not up on all the fancy A.I. tech out there....but how different is this from something like go-fuzz which does fuzz testing based on genetic algorithms (if my memory serves) written by Dmitry Vyukov from Google.

Edit link: https://github.com/dvyukov/go-fuzz

jjwsteele · on July 13, 2018

Fuzz testing is different than the kind of mutation testing being referred to here. Mutation testing is about creating 'mutant' versions of your source code and determining whether your test suite detects the mutant. It is a measure similar to code coverage, in that it is a measure for determining the effectiveness of your test suite.

For example, a mutation may be changing a '==' to '!='. Your test suite is then run over the mutated source code and the mutant is said to then be 'killed' if at least one test fails. This is repeated many times, each with a different mutation to your source code. Your test suite is then given a mutation score based on the number of mutants killed divided by the total mutants.

Of course, there are some mutations that actually produce functionally identical code to the original. This means it isn't always possible for your test suite to kill every mutant.

Because many different mutations are made, with each one resulting in your test suite being run, mutation testing can become expensive. Having only read the abstract, this looks to be about a way for determining parts of source code not worth mutating, hence reducing the amount of times your test suite needs to be run.

Puer · on July 13, 2018

Thank you for the clear explanation!

de_watcher · on July 13, 2018

Source code is the input data for the automated tests. Mutation testing is like "fuzzing the tests": providing a bad source code data to see if they fail to detect it.

carlsborg · on July 13, 2018

From the abstract of the referenced 1978 paper (Demillo et al): "One of the most interesting possibilities is that the mutation idea could form the basis for statistically inferring the likelihood of remaining errors in a program."

mchahn · on July 15, 2018

> ..but how different is this from ... fuzz testing

Very short answer. Fuzzing changes run-time input and mutation is changing the code.

fenollp · on July 13, 2018

There's also "An Industrial Application of Mutation Testing: Lessons, Challenges, and Research Directions (2018)" by the same authors: https://ai.google/research/pubs/pub46907

rbongers · on July 13, 2018

It's interesting that they mention lines and line coverage so much and not statement coverage. I would think that statement coverage would be a much more effective measure of what should be mutated if the metric is available for the instrumenting tool being used, otherwise it's often just going to be testing which covered lines contain uncovered statements. In other words, it's doing the coverage tool's job.

In any case, it seems like it could be a useful tool if developers know how to use it. It seems like this is ideal for catching tests which fail to actually test statements despite covering them. Like the post below mentions, it will probably result in tests that just detect change if developers are not trained on the tool and testing strategies.

mendelbot · on July 13, 2018

Go Goran!!