Random numbers are hard to work with and they are harder to debug. In this post I will demonstrate the testing of a shuffle function in Python. We first implement an unbiased version of Knuth-Fisher-Yates shuffle and then implement two biased ones. I will then use the functional properties of python to use the same tester for all three .

**What is the Knuth-Fisher-Yates shuffling? ** This shuffling is the correct method for shuffling a list or array. In this method you have a list and you want to shuffle it such that each element is equally likely to appear in any cell. Here is how it works, you have an iterator that goes through the elements of the list one by one, every time you pick an element you generate a random number that is between the position of that element and the position of the last element (inclusive), then you swap the two elements. It is easy to see that each element is equally likely to be appearing in any position in the final list. Below is the correct implementation of the Knuth shuffle in Python.

1 2 3 4 5 6 7 8 |
from random import randrange def shuf(arr): # Correct Knuth, Fisher-Yates Shuffle for i in range(0, len(arr)): pos = randrange(i, len(arr)) arr[i], arr[pos] = arr[pos], arr[i] return arr |

Allowing the element to be able to be swapped by itself is important. So is not allowing it to be swapped by anything before it. It is, in fact, relatively easy to come up with biased implementations of this function. I have implemented two methods that might look correct at the first look but they are biased and should be avoided. The first implementation takes the element but swaps it with any arbitrary element in the list. The second biased implementation does not allow the element to be swapped by itself.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from random import randrange def shuf_incorrect1(arr): for i in range(0, len(arr)): pos = randrange(0, len(arr)) arr[i], arr[pos] = arr[pos], arr[i] return arr def shuf_incorrect2(arr): for i in range(0, len(arr)): if i < len(arr)-1: pos = randrange(i+1, len(arr)) arr[i], arr[pos] = arr[pos], arr[i] return arr |

**How can we see this bias and how can we test these functions?**

Testing functions that use random number generators are hard. One way of testing these three functions is to look at the probability of each element appearing as the first element of the shuffled list. Ideally we expect the elements to be uniformly distributed in the shuffled list. The tester uses the functional nature of python and here we pass a function to our functions to test.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
def test_shuf(f): sims = 10000 counts = [0]*11 for sim in range(0, sims): x = range(0,11) y = f(x) counts[y[0]] = counts[y[0]]+ 1 counts = map(lambda l: (l*1.00)/float(sims), counts) return counts print "Biased Shuffle 1: ", test_shuf(shuf_incorrect1) print "Biased Shuffle 2: ", test_shuf(shuf_incorrect2) print "Correct Implementation: ", test_shuf(shuf) |

Output of this shows that the probabilities of any element appearing in the first position is only close to each other in the correct implementation and any of the biased implementations shows a nonuniform distribution.

1 2 3 4 |
Biased Shuffle 1: [0.090739, 0.117495, 0.109723, 0.103565, 0.097536, 0.091024, 0.08627, 0.08228, 0.077344, 0.073919, 0.070105] Biased Shuffle 2: [0.0, 0.100086, 0.100461, 0.099916, 0.099673, 0.100153, 0.100346, 0.099756, 0.100055, 0.099927, 0.099627] Correct Implementation: [0.090901, 0.090712, 0.090912, 0.09116, 0.090953, 0.091006, 0.090874, 0.090572, 0.0912, 0.090712, 0.090998] [Finished in 41.3s] |

The complete code is below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
from random import randrange def shuf(arr): # Correct Knuth, Fisher-Yates Shuffle for i in range(0, len(arr)): pos = randrange(i, len(arr)) arr[i], arr[pos] = arr[pos], arr[i] return arr def shuf_incorrect1(arr): for i in range(0, len(arr)): pos = randrange(0, len(arr)) arr[i], arr[pos] = arr[pos], arr[i] return arr def shuf_incorrect2(arr): for i in range(0, len(arr)): if i < len(arr)-1: pos = randrange(i+1, len(arr)) arr[i], arr[pos] = arr[pos], arr[i] return arr def test_shuf(f): sims = 1000000 counts = [0]*11 for sim in range(0, sims): x = range(0,11) y = f(x) counts[y[0]] = counts[y[0]]+ 1 counts = map(lambda l: (l*1.00)/float(sims), counts) return counts print "Biased Shuffle 1: ", test_shuf(shuf_incorrect1) print "Biased Shuffle 2: ", test_shuf(shuf_incorrect2) print "Correct Implementation: ", test_shuf(shuf) |

See the follow up post here

Pingback: How to unit test randomized algorithms using information entropy and t-test – data science tools and techniques