In the previous post we looked at how the Knuth-Fisher-Yates shuffle works and how we can test it in a very quick way. We implemented the correct version and two incorrect versions and to test them we looked at the distribution of the elements in the first cell for our list to make sure it is uniform. The output of the three functions were as below:

1 2 3 |
Biased Shuffle 1: [0.090739, 0.117495, 0.109723, 0.103565, 0.097536, 0.091024, 0.08627, 0.08228, 0.077344, 0.073919, 0.070105] Biased Shuffle 2: [0.0, 0.100086, 0.100461, 0.099916, 0.099673, 0.100153, 0.100346, 0.099756, 0.100055, 0.099927, 0.099627] Correct Implementation: [0.090901, 0.090712, 0.090912, 0.09116, 0.090953, 0.091006, 0.090874, 0.090572, 0.0912, 0.090712, 0.090998] |

It is relatively easy to see that the biased shuffle 2 is very different than the other two. But how can we really differentiate between the biased shuffle 1 and the correct implementation? They both seem to return uniform outputs. Besides, how can we come up with automated unit tests that we can run and stop the incorrect implementations from being used?

**Using Maximum Entropy: **

One way of testing these two implementations is to use the maximum entropy principle. The information entropy is defined as . Based on this definition, a more uniform distribution will have a higher entropy value than a non uniform distribution and spiky distribution and we use this property to compare the two implementations.

In the following code we print the average entropy of the output of four different functions: a biased implementation that looks actually pretty good if we naively look at its output, a correct implementation by us, Python’s internal implementation, and finally an ideal uniform output.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
from random import randrange, shuffle from math import log from scipy import stats, average def shuf(arr): # Correct Knuth, Fisher-Yates Shuffle for i in range(0, len(arr)): pos = randrange(i, len(arr)) arr[i], arr[pos] = arr[pos], arr[i] return arr def python_shuf(arr): # The internal implementaion of the shuffle shuffle(arr) return arr def shuf_incorrect(arr): # The biased implementation of the algorithm for i in range(0, len(arr)): pos = randrange(0, len(arr)) arr[i], arr[pos] = arr[pos], arr[i] return arr def test_shuf(f, sims = 1000): # tester function counts = [0]*11 for sim in range(0, sims): x = range(0,11) y = f(x) counts[y[0]] = counts[y[0]]+ 1 counts = map(lambda l: (l*1.00)/float(sims), counts) return counts def entropy(arr): H = 0 for i in arr: if i == 0: return -float('Inf') H -= i*log(i,2) return H if __name__=="__main__": df = 100 sims = 2000 entropy_vals_incorrect = [entropy(test_shuf(shuf_incorrect, sims = sims)) for i in range(0, df)] entropy_vals_correct = [entropy(test_shuf(shuf, sims = sims)) for i in range(0, df)] entropy_vals_unif = [entropy([1.0/11]*11) for i in range(0, df)] entropy_vals_pythonshuf = [entropy(test_shuf(python_shuf, sims = sims)) for i in range(0, df)] print "__________ Entropy Values _____________" print "Entropy of the correct implementation: ", average(entropy_vals_correct) print "Entropy of the incorrect implementation: ", average(entropy_vals_incorrect) print "Entropy of a uniform distribution: ", average(entropy_vals_unif) print "Entropy of the python shuffle: ",average(entropy_vals_pythonshuf) |

The output of this code is as below:

1 2 3 4 5 |
__________ Entropy Values _____________ Entropy of the correct implementation: 3.45598144977 Entropy of the incorrect implementation: 3.43787539846 Entropy of a uniform distribution: 3.45943161864 Entropy of the python shuffle: 3.45588513305 |

The incorrect implementation in fact has the smallest entropy, suggesting that it is biased. Our correct implementation has a higher entropy than Python’s internal shuffle function, does that mean we are better than the internal shuffle? Besides, if we did not have the correct implementation, the entropy of the incorrect implementation would have looks close enough to the entropy of the python shuffle. In this case we can use t-statistics and p-value of a t-test.

**Using t-test for unit testing:**

The Welch’s t-test or the Student t-test can help us see if the output of our shuffle functions are different from each other. We run each shuffle a couple of times and then run a t-test between the outputs. If the p-value is smaller than some threshold then the outputs are different. Below is the python and scipy codes for doing that:

1 2 3 |
print "__________ Testing for Equal Entropies _____________" print "Knuth vs. Python Shuffle: pval = ", stats.ttest_ind(entropy_vals_correct, entropy_vals_pythonshuf, equal_var = False)[1] print "Biased implementation vs. Python Shuffle: pval = ",stats.ttest_ind(entropy_vals_incorrect, entropy_vals_pythonshuf, equal_var = False)[1] |

The output of this piece of code is as follows. As we see the p-value of the t-test between Python shuffle and our own correct shuffle is above the 0.05 threshold but when we do a t-test between our biased implementation and the correct shuffle it is statistically significant highlighting the entropy value for our biased algorithm is in fact smaller than the correct python implementation.

1 2 3 |
__________ Testing for Equal Entropies _____________ Knuth vs. Python Shuffle: pval = 0.647361013665 Biased implementation vs. Python Shuffle: pval = 1.04934104407e-64 |

**Conclusion:**

I have found t-test to be a very interesting way to unit test randomized functions. In this post I use the information entropy and the t-test procedure to unit test a shuffle function. However, you would need to have at least 100 correct outputs from a correct and unbiased shuffle function to be able to run your t-test otherwise your t-test will not have enough power to detect bias. Using t-test has its own drawbacks though. Every 20 or so times that you run your tests, even the correct implementation will not pass the test since p-value might randomly be below 0.05 every 20 times when there is no true difference. The following the the whole script that I used here. I increased the degrees of freedom (the number of times we run the shuffle functions) to 10000 and the p-value still was not statistical significance showing a good behavior in our case.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
from random import randrange, shuffle from math import log from scipy import stats, average def shuf(arr): # Correct Knuth, Fisher-Yates Shuffle for i in range(0, len(arr)): pos = randrange(i, len(arr)) arr[i], arr[pos] = arr[pos], arr[i] return arr def python_shuf(arr): # The internal implementaion of the shuffle shuffle(arr) return arr def shuf_incorrect(arr): # The biased implementation of the algorithm for i in range(0, len(arr)): pos = randrange(0, len(arr)) arr[i], arr[pos] = arr[pos], arr[i] return arr def test_shuf(f, sims = 1000): # tester function counts = [0]*11 for sim in range(0, sims): x = range(0,11) y = f(x) counts[y[0]] = counts[y[0]]+ 1 counts = map(lambda l: (l*1.00)/float(sims), counts) return counts def entropy(arr): H = 0 for i in arr: if i == 0: return -float('Inf') H -= i*log(i,2) return H if __name__=="__main__": df = 10000 sims = 200 entropy_vals_incorrect = [entropy(test_shuf(shuf_incorrect, sims = sims)) for i in range(0, df)] entropy_vals_correct = [entropy(test_shuf(shuf, sims = sims)) for i in range(0, df)] entropy_vals_unif = [entropy([1.0/11]*11) for i in range(0, df)] entropy_vals_pythonshuf = [entropy(test_shuf(python_shuf, sims = sims)) for i in range(0, df)] print "__________ Entropy Values _____________" print "Entropy of the correct implementation: ", average(entropy_vals_correct) print "Entropy of the incorrect implementation: ", average(entropy_vals_incorrect) print "Entropy of a uniform distribution: ", average(entropy_vals_unif) print "Entropy of the python shuffle: ",average(entropy_vals_pythonshuf) print "__________ Testing for Equal Entropies _____________" print "Knuth vs. Python Shuffle: pval = ", stats.ttest_ind(entropy_vals_correct, entropy_vals_pythonshuf, equal_var = False)[1] print "Biased implementation vs. Python Shuffle: pval = ",stats.ttest_ind(entropy_vals_incorrect, entropy_vals_pythonshuf, equal_var = False)[1] |

1 2 3 4 5 6 7 8 |
__________ Entropy Values _____________ Entropy of the correct implementation: 3.42292527516 Entropy of the incorrect implementation: 3.40437341717 Entropy of a uniform distribution: 3.45943161864 Entropy of the python shuffle: 3.42290675337 __________ Testing for Equal Entropies _____________ Knuth vs. Python Shuffle: pval = 0.935978596364 Biased implementation vs. Python Shuffle: pval = 0.0 |