How to unit test randomized algorithms using information entropy and t-test

In the previous post we looked at how the Knuth-Fisher-Yates shuffle works and how we can test it in a very quick way. We implemented the correct version and two incorrect versions and to test them we looked at the distribution of the elements in the first cell for our list to make sure it is uniform. The output of the three functions were as below:

It is relatively easy to see that the biased shuffle 2 is very different than the other two. But how can we really differentiate between the biased shuffle 1 and the correct implementation? They both seem to return uniform outputs. Besides, how can we come up with automated unit tests that we can run and stop the incorrect implementations from being used?

Using Maximum Entropy:

One way of testing these two implementations is to use the maximum entropy principle. The information entropy is defined as H = \sum p_i \log_2 p_i . Based on this definition, a more uniform distribution will have a higher entropy value than a non uniform distribution and spiky distribution and we use this property to compare the two implementations.

In the following code we print the average entropy of the output of four different functions: a biased implementation that looks actually pretty good if we naively look at its output, a correct implementation by us, Python’s internal implementation, and finally an ideal uniform output.

The output of this code is as below:

The incorrect implementation in fact has the smallest entropy, suggesting that it is biased. Our correct implementation has a higher entropy than Python’s internal shuffle function, does that mean we are better than the internal shuffle? Besides, if we did not have the correct implementation, the entropy of the incorrect implementation would have looks close enough to the entropy of the python shuffle. In this case we can use t-statistics and p-value of a t-test.

Using t-test for unit testing:
The Welch’s t-test or the Student t-test can help us see if the output of our shuffle functions are different from each other. We run each shuffle a couple of times and then run a t-test between the outputs. If the p-value is smaller than some threshold then the outputs are different. Below is the python and scipy codes for doing that:

The output of this piece of code is as follows. As we see the p-value of the t-test between Python shuffle and our own correct shuffle is above the 0.05 threshold but when we do a t-test between our biased implementation and the correct shuffle it is statistically significant highlighting the entropy value for our biased algorithm is in fact smaller than the correct python implementation.

I have found t-test to be a very interesting way to unit test randomized functions. In this post I use the information entropy and the t-test procedure to unit test a shuffle function. However, you would need to have at least 100 correct outputs from a correct and unbiased shuffle function to be able to run your t-test otherwise your t-test will not have enough power to detect bias. Using t-test has its own drawbacks though. Every 20 or so times that you run your tests, even the correct implementation will not pass the test since p-value might randomly be below 0.05 every 20 times when there is no true difference. The following the the whole script that I used here. I increased the degrees of freedom (the number of times we run the shuffle functions) to 10000 and the p-value still was not statistical significance showing a good behavior in our case.

The absolution value of the normal distribution
|N(μ, σ2)|

Let’s start with an interesting question. If I split a normal distribution in half and find the center for each half and then subtract the left center from the center of the right half what will I get?


Surprisingly, I have run into this problem twice in my daily work and I thought it would make an interesting blog post. The first time I had to solve this problem was when I was implementing this paper. The analytical form of this solution is relatively easy to derive, but we will end up doing some simplifications.

\int_{0}^{\infty} x . \tfrac{2}{\sigma \sqrt{2 \pi }}e^{- \tfrac{(x-\mu )^2}{2\sigma ^2}} dx - \int_{- \infty}^{0} x . \tfrac{2}{\sigma \sqrt{2 \pi }}e^{- \tfrac{(x-\mu )^2}{2\sigma ^2}} dx

The 2 in the numerator for \tfrac{2}{\sigma \sqrt{2 \pi }} is there to transfrom the half-normal distribution into a proper probability distribution that adds up to one. The expression can now be simplified to

= 4 \int_{0}^{\infty} x . \tfrac{1}{\sigma \sqrt{2 \pi }}e^{- \tfrac{(x-\mu )^2}{2\sigma ^2}} dx

mu and sigma are constant parameters therefore

= 4 . \tfrac{1}{\sigma \sqrt{2 \pi }} \int_{0}^{\infty} x e^{- \tfrac{(x-\mu )^2}{2\sigma ^2}} dx

There is no analytical solution for this (you would end up with the error function). However, we can easily realize that mu is not going to change the results and the expression will have the same value for any arbitrary value of mu. We can basically transfer the two half-normal distributions along the x axis such that the final normal distribution is zero-mean. For an standard normal (mu =0 , sigma = 1), we arrive at the following expression that is easily solvable.

= 4 . \tfrac{1}{\sqrt{2 \pi }} \int_{0}^{\infty} x e^{- \tfrac{x^2}{2}} dx

This expression will result in

= 2 \sqrt{\tfrac{2}{\pi}}\left [ -e^{\tfrac{-x^2}{2}} \right ]_{0}^{\infty} = 2 \sqrt{\tfrac{2}{\pi}}

It is therefore easy to show that for any other normal distribution with sigma = σ the expression for the mean will be as the following:

= 2 \sigma \sqrt{\tfrac{2}{\pi}}

I usually evaluate my analytical solution with a little simulation code. The following code, although not optimized for performance, can show us that the analytical solution is correct.


The output of this little code is as follows. And it is sufficiently close to 1.595769, the theoretical value.

The absolute value of the normal distribution follows a distribution called the half-normal distribution. Using the same method, it is easy to show that the mean value for the half-normal distribution is

=  \sigma \sqrt{\tfrac{2}{\pi}}

How do you interpret the Benford’s Law? a practical technique for detecting instrumentation and logging problems

Benford’s Law is a mesmerizing behavior in numbers and statistics. Besides, it is pretty practical too. It says if you take any count statistic (like the population of cities in the world) and take the first, left-most, digit and calculate the distribution of these digits then they are distributed according to log10(1+1/d) in which d is the digit.

It is pretty powerful and extremely practical. It has been mainly used in detecting fraud. For example you can use it in detecting election fraud by running it on the number of votes in each box. I sometimes use it to see if our data logging is done correctly. Turns out that if your logger skipped some data and did not write some of your data to your logs then the Benford’s law breaks and you can detect your logging problem by applying this law to your data (without even examining your logger or instrumentation logic).

My interpretation of the Benford’s law
Surprisingly the reason why Benford’s law works is still very much a mystery. The way I look at it is by imagining a counter and a Poisson clock. The counter starts from 1 and goes up 1 at every tick, however the Poisson clock is rigged and instead of ticking according to an exponential random variable with a fixed Lambda λ, it ticks with a random λ at every tick. When the clock ticks we take the counter value, take its first digit and reset it to 1 again.

The code that comes below implements this logic very efficiently and the result agrees with the Benford’s Law well.

However what Benford’s Law says is that you can come up with various simulation logics and get results that agree well with what I got from my Poisson process. I am now wondering what other processes can be used to generate a sequence of numbers according to the Benford’s Law?

What is the probability of having 3 double-yolk eggs in a dozen?

A friend of mine was making an omelet and found 3 double-yolk eggs in the box. What is the probability of that happening? (assume only 1 out of every 100 eggs is double-yolk)

Answer: 0.000205616 (probability of seeing 3 or more double-yolk eggs in a box)


Let’s make this question more interesting by changing it to: “what is the probability of having at least 2 double-yolk eggs in a box of dozen?”. For this, it would be in fact easier to find the probability of having only 1 or no double-yolk eggs in the box and then calculate one minus that probability.

1 - Prob(only \nobreakspace 1 \nobreakspace regular \nobreakspace egg) - Prob(no \nobreakspace regular \nobreakspace egg)


In the general form, if we want to have at least a minimum number of double-yolk eggs we can arrive at the following formula using the Binomial distribution:

1 - \sum_{i=0}^{min \nobreakspace double-yolks} [P^i \times (1-P)^{(Box \nobreakspace Size - i)} \times {Box Size \choose i} ]


If you solve this you will arrive at 0.000205616 as the answer.

To be sure that this formula is correct I wrote a little c++ code that could simulate this process. The results from the theoretical formula matches the simulation results.

Four different methods for solving a very simple problem: Expected number of coin tosses to get one tail!

What is the expected number of times that you need to throw a biased coin to get the first tail?

This is an interesting question. Let’s assume the probability of getting a tail is p, how many times do we need to toss the coin to get the tail? Let’s solve it using a couple of different methods.

Solution 1: Using the definition of expectation

This is the hardest solution actually. Simply because with this method you will arrive at a geometric series that is not really easy to solve. Regardless let’s use the definition of expectation. By definition the expected value is as follows:

E[# tosses until a tail] = 1.(prob of getting a tail at one toss) + 2.(prob of getting a head, then a tail) + 3(prob of getting two heads then a tail) + … + n.(prob of getting n-1 heads then one tail)

This value can be written as

E[X] = 1.p + 2.(1-p).p + 3.(1-p)^2.p +\cdots+ np (1-p)^{n-1}

This can be simplified to the following

p \sum_{n=1}^\infty n(1-p)^{n-1}

The sum would be equal to \frac{1}{p^2} (you can either use Wolfram alpha or use this theorem)  therefore E[x] = 1/p

Solution 2: Using recurrence

There is a little trick that makes your life very easy. You might notice that the process at any point is independent of what has happened previously. Meaning that if you toss the coin and get a head then the expected number of tosses to get one tail is still E[X]. Therefore:

E[X] = 1.p + (1-p)(1 + E[X])

if we solve for E[X] we will have:

E[X] = \frac{1}{p}

This way you can avoid going through solving the series of \sum_{n=1}^\infty n(1-p)^{n-1} that we had to do in the previous solution.

Solution 3: Using geometric random variables 

This is probably the easiest solution. The number of tosses until one tail follows a geometric distribution and the mean for geometric distribution is 1/p. (Also see page 40 of Ross’s Introduction to Probability Models)

Solution 4: Using simulation 

This is easy to solve using simulation. The following is the R code for calculating this expected value.

And the output will be

If you prefer Python over R then the following is the most efficient implementation that I can come up with


Now can you solve this problem: “find the expected number of coin tosses to get two consecutive tails”


Top-down recursion, dynamic programming and memoization in Python

I was talking to a friend about dynamic programming and I realized his understanding of dynamic programming is basically converting a recursive function to an iterative function that calculates all the values up to the value that we are interested in. While this definition is not incorrect it only refers to a sub category of dynamic programs that is called the “bottom-up” solution to the dynamic programs.

Contrary to my friend’s belief, a dynamic programming solution can be recursive. However, in this approach you calculate previous values only when you need them, and only if you have not calculated these values already. This lazy solution of dynamic programs is called the “top-down” solutions. The top-down approach to dynamic programming is using a combination of recursive and memoization. In a generic recursive solution after you calculate the value of f(n-1) you probably throw it away. In the recursive solution, next time you need the f(n-1) value, you need to recalculate it.

The top-down dynamic programing approach is a combination of recursion and memoization. Memoization allows you to produce a look up table for f(x) values. As soon as you calculate f(n-1), you enter n-1 into a hash table (i.e., a Python dictionary) as the key and also enter f(n-1) as the value. This way, next time you need the f(n-1) you just look it up in your dictionary in O(1).

Let’s look at an example that I gave to my friend. Let’s solve the coins changing problem but just to make the code interesting we will solve a slightly harder problem. Here is the problem, assume you have pennies, nickels, dimes and quarters and you are interested in changing some arbitrary amount of money with these coins. We want to have a function named change() that is called like this change(100, [1,5,10,25]) and returns a tuple like (4, {25: 4}) meaning that I need 4 coins and the dictionary will be how the coins are changed (all four are quarters).

Below is some sample outputs for this solution

While this solution works, this is painstakingly slow. This is because the code will calculate the value of change(9, [1,5,10,25]) every time you are calculating the change that is larger than 9 (or any other amount). When I timed 40,000 runs of this, it ended up spending 20 minutes calculating all the values. If you add memoization to this, it will take only 11 seconds.

Memoizing this function is slightly more difficult than a regular function that is like f(n). There, you can just have a hash table with n as key. But in our function we need to memoize values like change(9,[1,5,10]) and change(9,[1,5]) in different cells. There is a smart solution in Python cookbook for this problem. You can use Pickle module to convert both 9 and the [1,5,10] array to one single string and store the output of this input in a dictionary.

Running this code 40,000 times only takes 11 seconds as opposed to 20 minutes without memoization. I am however wondering if we can optimize the code even further?

A quick introduction to statistical power

A friend of mine asked me an interesting question today. He said he has one million data points in his experiment yet the p value of a t-test on the average metric value on the two population was not smaller than 0.05, he was wondering if he could safely assume the two populations had the same average metric?

The truth is, the fact that the p value is not smaller than 0.05 can be merely a consequence of your experiment not having enough power. Meaning that there were not large enough samples to be able to reject the null hypothesis. To make this more clear we can picture an extreme case in which instead of one million users we have only 3 users. If p value is not smaller than 0.05 we certainly cannot claim the two population had the same mean value.

Statisticians tend to look at the statistical power. The statistical power is the probability of rejecting the null hypothesis when it is in fact false. Power = P(Reject H0| H0 is false). In other words the power for the test tells you how likely it is that your test is doing its job correctly.

Let’s demonstrate this with a quick R script. My friend told me that the difference in the two means is just 0.10% and since he had 1 million datapoints, he suggested that since the p value of the t-test was not significant then there is no real effect and any difference is due to just noise. Well, I constructed a little experiment, with one million data points, in which the true difference of means was .10%. I ran it 500 times and counted how many times I have correctly rejected the null hypothesis. Turns out that there is about 90% chance that the p value turns out to be non significant even though we already know that there truly is a difference in the mean.

Statistical Power

The command power.t.test() calculates the statistical power for my friend’s test and the power turns out to be slightly above 10%. We calculate the power with the quick simulation in the next lines and after 500 iterations the power value converges to about 0.118. The theoretical value turns out to be 0.105. If he had 10 million data points then his test power would increase from 10.5% to 61% and finally with about 100 million points the power is at about 99.99%. Estimating the number of data points that you need to get a statistically significant result (assuming a true effect exist) is usually referred to as power analysis and sample size determination.

So next time your friend tells you the p value was not significant you can just tell them to increase their sample size to have enough power. Any other conclusion might be reading too much into the data.

Expectation of a function of a random variable

If X is a random variable distributed according to the f(x) as its probability density function then what is the expectation of g(x)?

We know that

E[X] = \int_{-\infty}^{\infty} f(x) dx

it is easy to show that

E[g(X)] = \int_{-\infty}^{\infty} g(x).f(x) dx

For example the expected value of Y= sin(X) if X~U(0,1) would be

E[sin(X)] = \int_{0}^{1} sin(x) dx = -cos(1)+cos(0)

Now the question is what is the

E(e^X) if X \sim N(0,1)

This is in fact the expected value for the log normal distribution and if you do the integration you will arrive at

E(e^X) = e^{ \mu + \frac{1}{2}\sigma^2 }

Python Tips: Adding two dictionaries

I have two dictionaries with numerical values. How can I add these two dictionaries such that the resulting dictionary has the same keys but its values is the summation of the values in my original dictionaries?

In other words we are looking for a way to take x = {'a':1, 'b': 2} and y = {'b':10, 'c': 11, 'a':3, 'd':7} and output {'a': 4, 'c': 11, 'b': 12, 'd': 7} .

The following snippet does exactly that by using a couple of little Python tricks like list dictionary comprehensions.

Let’s analyze this line {a: x.get(a,0)+y.get(a,0) for a in set(x) | set(y)}

  • set(x) | set(y) produces a set whose keys are the union of the keys for x and y
  • Hence for a in set(x) | set(y) results in just the keys for the output set {'a', 'b', 'c', 'd'}
  • x.get(a,0) works just like x[a] with one difference. If a is not among the keys of x it will return 0 instead of throwing a key error
  • Finally a: x.get(a,0)+y.get(a,0) generates a dictionary with key a and the value a: x.get(a,0)+y.get(a,0)

C# Tips: Converting a DateTime from one TimeZone to another TimeZone

How can I convert a date and time from PST to EST?

Below is the easiest way to convert a DateTime value from a Pacific Time to Eastern Time.

The important thing is to specify a kind for your original DateTime for the original DateTime value otherwise the ConvertTime() will raise an exception that reads:

The conversion could not be completed because the supplied DateTime did not have the Kind property set correctly. For example, when the Kind property is DateTimeKind.Local, the source time zone must be TimeZoneInfo.Local.
Parameter name: sourceTimeZone