# How to unit test randomized algorithms using information entropy and t-test

In the previous post we looked at how the Knuth-Fisher-Yates shuffle works and how we can test it in a very quick way. We implemented the correct version and two incorrect versions and to test them we looked at the distribution of the elements in the first cell for our list to make sure it is uniform. The output of the three functions were as below:

It is relatively easy to see that the biased shuffle 2 is very different than the other two. But how can we really differentiate between the biased shuffle 1 and the correct implementation? They both seem to return uniform outputs. Besides, how can we come up with automated unit tests that we can run and stop the incorrect implementations from being used?

Using Maximum Entropy:

One way of testing these two implementations is to use the maximum entropy principle. The information entropy is defined as $H = \sum p_i \log_2 p_i$. Based on this definition, a more uniform distribution will have a higher entropy value than a non uniform distribution and spiky distribution and we use this property to compare the two implementations.

In the following code we print the average entropy of the output of four different functions: a biased implementation that looks actually pretty good if we naively look at its output, a correct implementation by us, Python’s internal implementation, and finally an ideal uniform output.

The output of this code is as below:

The incorrect implementation in fact has the smallest entropy, suggesting that it is biased. Our correct implementation has a higher entropy than Python’s internal shuffle function, does that mean we are better than the internal shuffle? Besides, if we did not have the correct implementation, the entropy of the incorrect implementation would have looks close enough to the entropy of the python shuffle. In this case we can use t-statistics and p-value of a t-test.

Using t-test for unit testing:
The Welch’s t-test or the Student t-test can help us see if the output of our shuffle functions are different from each other. We run each shuffle a couple of times and then run a t-test between the outputs. If the p-value is smaller than some threshold then the outputs are different. Below is the python and scipy codes for doing that:

The output of this piece of code is as follows. As we see the p-value of the t-test between Python shuffle and our own correct shuffle is above the 0.05 threshold but when we do a t-test between our biased implementation and the correct shuffle it is statistically significant highlighting the entropy value for our biased algorithm is in fact smaller than the correct python implementation.

Conclusion:
I have found t-test to be a very interesting way to unit test randomized functions. In this post I use the information entropy and the t-test procedure to unit test a shuffle function. However, you would need to have at least 100 correct outputs from a correct and unbiased shuffle function to be able to run your t-test otherwise your t-test will not have enough power to detect bias. Using t-test has its own drawbacks though. Every 20 or so times that you run your tests, even the correct implementation will not pass the test since p-value might randomly be below 0.05 every 20 times when there is no true difference. The following the the whole script that I used here. I increased the degrees of freedom (the number of times we run the shuffle functions) to 10000 and the p-value still was not statistical significance showing a good behavior in our case.