Testing Software

I’ve never been good about writing tests for my code, or rather as I’m coding. Whenever I do sit down to write tests I feel like it is taking too much time and providing only limited benefit. The only real benefit I have found from testing is that trying to write testable code almost always improves my design and implementation. So why do I find actually writing tests to be so difficult and frustrating? My guess is that the answer has two parts. First of all, it is because I’m wasting my effort on low value tests. Because I haven’t dedicated enough time to learning how to test it is easy for me to fall into testing trivial functionality in isolation, when such trivialities would be better left to be exercised by more sophisticated tests. A good solution to this may be to focus on testing isolated units of functionality, rather than isolated units of implementation. Second, I haven’t written enough tests to truly become proficient with any testing framework other than pytest; this leads me to spend so much time trying to figure out the “right” way or the “best” way to implement my tests that I get stuck on trivialities like fixtures instead of focusing on designing and implementing actual tests.

To kick off my education in writing better tests I’m going to read and understand the EUnit tests written by Fred Hebert for testing the ppool application in Learn You Some Erlang. I think this is a good test suite to look at because it falls nicely in the space of the sort of tests I should be writing all the time; it isn’t a tutorial on testing about EUnit, but it was written with novices in mind (I assume). It also isn’t an extremely rigorous test suite of tests built for an important project, but rather just something used to get stuff done.

What gets tested?

A pool server can be started and registered under a name.

My reaction was that this test is unnecessary. Checking that the server can be started seems like the kind of thing that is implicitly covered by all the other tests that rely on a running server. However, because the server must be registered at a name provided by the user a test that checks the registration of the new server seems useful. Furthermore, checking that the name is unregistered when the pool stops is also an important aspect of this test.

Pool processes run the function provided at start time with the arguments provided at call time.

This test relies on the pool_nagger module implemented to demonstrate the pool application. Since the pool requires an mfa() term to run, we have to give it a module. My inclination in the past has been to implement the API in the test module and pass the test module to the code under test. That can get pretty messy, but it does reduce the number of files that need to be kept in sync. I’m not sure what the best approach is for a “serious” application, but using a small demo module seems very reasonable for this quick example.

The pool accepts new requests to run workers only until the specified pool size is reached, then new requests are refused.

These two tests are a good example of testable code. Providing a synchronous API as well as an asynchronous API makes testing this simple. Of course the two APIs don’t just exist for testing, but writing this kind of test while designing would force you to consider what kind of feedback you are providing to your clients.

After a pool worker process in a fully allocated pool exits, a new process may be allocated.

Here the synchronous API is used again, but delays are added so some pool workers will finish before the third task is allocated.

Requests to start workers can be queued if there is no worker available.

Yet again, the synchronous API makes testing this much simpler. In test_async_queue the synchronous ppool:run/2 function is used to assert that the first two asynchronous requests ran immediately because the queue was not full. Then, knowing the precondition for the test is satisfied (the queue is full), the final asynchronous call can be executed. If it succeeds we know that it was queued, then popped off the queue and executed once one of the previous two workers exited.

The second part of this test doesn’t use the ppool:run/2 call to check the test preconditions, but those were already established as working in the previous test. Tasks are added through a mixture of the synchronous and asynchronous APIs, but not much can be said about whether they are working correctly or not. For example, the first synchronous call should return immediately, but the final one should block. I’m not sure how you could test that, other than by checking the time before and after the call and comparing it to the time required for the first synchronous call to return.

Alternatively, you could submit jobs that would block the ppool longer than the default timeout for gen_server:call/2, then assert that the final ppool:sync_queue/2 call exits with reason timeout. This actually raises a great question about what the correct behavior should be under these circumstances. Should the job run anyway? Since that corner case wasn’t tested, there was no need to define what should happen. The trade off here is that the tests were quick and easy to develop, but adding in coverage for this edge case will definitely slow down the development, and for this project it doesn’t really matter.

All workers terminate when the pool is stopped, even if they are trapping exits.

This is pretty straightforward (schedule something that takes a long time, then check that it is dead before that time is up), but it isn’t clear without reviewing the pool_nagger module that it tests the “even if they are trapping exits” part of the requirement.

Dead children are never restarted. (This is grouped into the same fixture as 6, but seems logically distinct to me.)

It isn’t clear to me that this is actually tested. To do this, you would need to kill the child (exit(Pid, die)) and then verify that it wasn’t restarted. One way would be to fill the pool with long running tasks, kill one, then after a short timeout (shorter than any task previously added to the queue) assert that {ok, Pid} = ppool:run(...).

Only 'DOWN' signals from workers cause a new task to be dequeued; signals from any other process do not.

Here a bunch of 'DOWN' messages are sent with made up references, and the test verifies that the expected output from all the tasks is still received. The timing in this test is pretty subtle though. To check that none of the workers were deallocated work is added to the queue with delays such that one short task will be scheduled first, the two long tasks will block the queue and prevent an almost immediate task from being executed. A delay is introduced in the test process to let the first task finish before trying to kill the remaining tasks. After this, some time is given for signals to arrive at the test process, and the messages are checked to ensure that only the first test actually ran.

I think this test could be improved by changing the output from the workers so each one returned something unique. That way in the ?_assertMatch(...) expression it would be clear that the only message we should expect to see originates from the first task.

Some final thoughts

It is very common to see calls to timer:sleep/1 in erlang tests. This is just the nature of testing concurrent programs; if you are trying to assert that some other process has done something, then you must give the other process time to do it. By taking advantage of massive concurrency to run the test suite (potentially) in parallel, these small delays (100-600 ms) probably don’t add up to much in the overall running time for the test suite. They bother me for another reason though: they create the potential for nondeterministic behavior in the tests. For every call to timer:sleep/1 there is a chance however unlikely, that the event we are waiting for won’t happen in that time. For a test like start_and_test_name inserting a delay to wait for a process to terminate seems unnecessary when instead we can rely on monitors to know when it is safe to proceed to the next part of the test. Instead of this

start_and_test_name(Name) ->
    ppool:start_pool(Name, 1, {ppool_nagger, start_link, []}),
    A = whereis(Name),
    ppool:stop_pool(Name),
    timer:sleep(100),
    B = whereis(Name),
    [?_assert(undefined =/= A),
     ?_assertEqual(undefined, B)].

we could have written this:

start_and_test_name(Name) ->
    ppool:start_pool(Name, 1, {ppool_nagger, start_link, []}),
    A = whereis(Name),
    Ref = monitor(process, Name),
    ppool:stop_pool(Name),
    receive {'DOWN', Ref, process, _Pid, _Reason} -> ok end,
    B = whereis(Name),
    [?_assert(undefined =/= A),
     ?_assertEqual(undefined, B)].

To me that seems more clear. It will never be possible to eliminate all calls to timer:sleep/1, but at least wherever possible it is nice to make it clear in the code that we are waiting for a specific event before proceeding with the test.

How do you know how deep to test?

As mentioned in the discussion of test 5 we could test for an interesting corner case, but it doesn’t really matter for this project. That begs the question, “how far should you go?” My work has grappled with this frequently. I build a lot of “research software” where the software isn’t really the end product, the papers and reports are. There is rarely an expectation that code will be reused or even applied to different data sets in the future, let alone maintained so deep tests often seem like overkill. Institutionally we have tried to define a tiered model of software quality where low-risk/low-exposure projects don’t need testing and as the risk and exposure increase testing must also increase. I haven’t seen many projects move from the low to high category, but I do wonder how difficult it would be to tack on testing as an after thought to a code that wasn’t designed with testability in mind from the beginning. Maybe that means the low-risk version of a project is best treated as an elaborate prototype and discarded when the project moves up to a higher tier.

Pool photo by Etienne Girardet

Testing Software.

What gets tested?

Some final thoughts

How do you know how deep to test?