In a previous blog entry I mentioned that I like emma for measuring code coverage in my JUnit test suites. In this entry I will discuss several items that make some JUnit test suites stronger than others. In this entry I am discussing JUnit 3.8.1.

How do you measure the quality of your JUnit tests and put a number
on it? This is a very good question with no quick and easy answers,
I’m afraid. I can tell you two answers that are NOT measures of test quality:

  • “We have 100% success rate in our test suites every night.”
  • “We have 100% code coverage”


Executing Code is not Testing Code

One problem is that code coverage is really only the tip of the iceberg. Obviously you’re not testing code you don’t run, so code coverage is great for a “fail-fast” or a “trivial-exclude” approach to determining the quality of your tests. But just because you’re executing code does not mean you’re testing it at all! Consider the following JUnit test:

public void testNotMuchReally() {
    ClassUnderTest instance = new ClassUnderTest();
    instance.runLotsOfStuff();
}

This is about as bad a test as you can get, short of not having it at all. It will get great numbers in the code coverage statistics, and will never get a failure (it will only succeed or get an error). The only thing it does is run the code and hope for successful completion. There isn’t even a single assert in this test. Presumably there is some effect of executing the runLotsOfStuff() method — otherwise, why does it exist? The effects of this method need to be checked.

For our example, let’s assume that the side-effect of the runLotsOfStuff() method is to populate a List property called items in the ClassUnderTest class. A much stronger test would look like this:

public void testBetter() {
    ClassUnderTest instance = new ClassUnderTest();
    assertTrue(instance.getItems().isEmpty());
    instance.runLotsOfStuff();
    assertFalse(instance.getItems().isEmpty());
}

Note that in the new test I am asserting both pre- and post- conditions — a much stronger test - but still far from perfect.

Indirect vs. Direct testing

What if the runLotsOfStuff() method calls other methods that call other methods, one of which is called loadItems() and that’s the one that actually manipulates the items property rather than runLotsOfStuff(). In this case we are testing loadItems() indirectly by calling runLotsOfStuff(). If we continue to make the mistake of equating code coverage with testing, we still have the problem of false positives– we will have coverage for methods that are not being directly tested. The entire hierarchy of called methods in runLotsOfStuff() is not being checked for their side effects, only loadItems(), so our coverage is overstating what we’re actually testing.

Sometimes, you don’t have much of a choice (like testing an abstract class, which you can’t instantiate on its own), but whenever possible, you want to test as directly as possible. I say this because you are able to have greater certainty that the success or failure of your test was because of the class under test and not because of all the layers between your test and the class under test. Consider this ridiculous and contrived example for making my point:

public void testClassUnderTestIndirectlyFalsePositive(){
  IndirectClass i = new IndirectClass();
  assertEquals(3,i.countItems());
}
...
class IndirectClass {
  ClassUnderTest cut = new ClassUnderTest();
  public int countItems() {
    return 3;   // should actually return cut.size();
  }
}
...
class ClassUnderTest {
  List items = new ArrayList();
}

The method countItems() should return cut.size() but returns a hard-coded 3. The test case which tests the ClassUnderTest indirectly will report success, no matter how many items are in the actual items collection it contains which is a false positive if we are intending to test the ClassUnderTest class. (Note however that in this silly example, the test case does directly test the IndirectClass class rather effectively.)

Mutation Testing

There is a product out there called Jester that performs what’s known as mutation testing or mutation analysis. The idea is an ingenious, somewhat obscure, rather aggressive approach to testing. It effectively tests your tests. It works like this: First, you run your tests normally. Assuming they all pass, a tool then goes through your object code (not the test suite, the regular code being tested) and randomly introduces changes, such as negating if conditions, or changing a test from if (x > y) to if (false), then rerunning the entire test suite again. The theory goes that if your test suite is strong, the modified code should start failing your tests. If the modified code still passes your test suite, your test suite is not strong enough.

The problem with mutation testing is that computationally it is very expensive. You need to re-run the entire suite of tests for each random change introduced, and it generally takes many individual mutations to effectively mutation-test classes of real size. We’re talking orders of magnitude more time to run your tests; the larger your system, the more mutations are required (since there are more branch points into which a mutation can be inserted). For large systems it does not take long before the code changes faster than a mutation test cycle can be completed.

Measuring Test Quality

So, back to my original question: how do you measure the quality of your JUnit tests and put a number on it? Even JUnit itself won’t tell you how many assertions were made during the course of a test run — in my view, a major flaw of the framework (maybe someday I’ll tweak the source and add that feature for use on my own projects). If we had a number like this, we could have measures like of assertion density (asserts per test, asserts per suite, etc.) and assertion success rate (as opposed to test success rate), etc. Even a rough measure like this is better than no measure, just like having code coverage numbers is better than having no idea what’s tested at all.

I think another partial answer to quantifying test quality may lie in Aspect Oriented Programming. The approach I would take using AOP would be to intercept entry into every method in the system under test, and examine the call stack to see how deep it is until the caller is some subclass of TestCase — or something functionally equivalent to doing this. This would give you an idea of how directly something is being tested: whether the code being executed was directly called by the test case, or more as a side effect. Then this could be used as a weighting factor against the number of times the method is called to produce some score for how “invoked” the method was in tests But this is not a trivial matter, and even then, knowing how intentionally and directly a method was invoked doesn’t tell you how agressively its results were examined. Even AOP won’t tell you with certainty that the results of the methods were actually checked in an assert.

This may be as far as you can practically go with measurement. Probably it’s as far as you should go. Ultimately, like any other metric, any test quality metric would be just a number that must be interpreted by someone with the experience and wisdom to know what the number means. Even great tools like PMD, Checkstyle, and FindBugs don’t put an overall quality rating on a piece of code.

Guess it’s time for the old trusty peer-review.

Leave a Reply