Four different methods for solving a very simple problem: Expected number of coin tosses to get one tail!

What is the expected number of times that you need to throw a biased coin to get the first tail?

This is an interesting question. Let’s assume the probability of getting a tail is p, how many times do we need to toss the coin to get the tail? Let’s solve it using a couple of different methods.

Solution 1: Using the definition of expectation

This is the hardest solution actually. Simply because with this method you will arrive at a geometric series that is not really easy to solve. Regardless let’s use the definition of expectation. By definition the expected value is as follows:

E[# tosses until a tail] = 1.(prob of getting a tail at one toss) + 2.(prob of getting a head, then a tail) + 3(prob of getting two heads then a tail) + … + n.(prob of getting n-1 heads then one tail)

This value can be written as

E[X] = 1.p + 2.(1-p).p + 3.(1-p)^2.p +\cdots+ np (1-p)^{n-1}

This can be simplified to the following

p \sum_{n=1}^\infty n(1-p)^{n-1}

The sum would be equal to \frac{1}{p^2} (you can either use Wolfram alpha or use this theorem)  therefore E[x] = 1/p

Solution 2: Using recurrence

There is a little trick that makes your life very easy. You might notice that the process at any point is independent of what has happened previously. Meaning that if you toss the coin and get a head then the expected number of tosses to get one tail is still E[X]. Therefore:

E[X] = 1.p + (1-p)(1 + E[X])

if we solve for E[X] we will have:

E[X] = \frac{1}{p}

This way you can avoid going through solving the series of \sum_{n=1}^\infty n(1-p)^{n-1} that we had to do in the previous solution.

Solution 3: Using geometric random variables 

This is probably the easiest solution. The number of tosses until one tail follows a geometric distribution and the mean for geometric distribution is 1/p. (Also see page 40 of Ross’s Introduction to Probability Models)

Solution 4: Using simulation 

This is easy to solve using simulation. The following is the R code for calculating this expected value.

And the output will be

If you prefer Python over R then the following is the most efficient implementation that I can come up with


Now can you solve this problem: “find the expected number of coin tosses to get two consecutive tails”


Top-down recursion, dynamic programming and memoization in Python

I was talking to a friend about dynamic programming and I realized his understanding of dynamic programming is basically converting a recursive function to an iterative function that calculates all the values up to the value that we are interested in. While this definition is not incorrect it only refers to a sub category of dynamic programs that is called the “bottom-up” solution to the dynamic programs.

Contrary to my friend’s belief, a dynamic programming solution can be recursive. However, in this approach you calculate previous values only when you need them, and only if you have not calculated these values already. This lazy solution of dynamic programs is called the “top-down” solutions. The top-down approach to dynamic programming is using a combination of recursive and memoization. In a generic recursive solution after you calculate the value of f(n-1) you probably throw it away. In the recursive solution, next time you need the f(n-1) value, you need to recalculate it.

The top-down dynamic programing approach is a combination of recursion and memoization. Memoization allows you to produce a look up table for f(x) values. As soon as you calculate f(n-1), you enter n-1 into a hash table (i.e., a Python dictionary) as the key and also enter f(n-1) as the value. This way, next time you need the f(n-1) you just look it up in your dictionary in O(1).

Let’s look at an example that I gave to my friend. Let’s solve the coins changing problem but just to make the code interesting we will solve a slightly harder problem. Here is the problem, assume you have pennies, nickels, dimes and quarters and you are interested in changing some arbitrary amount of money with these coins. We want to have a function named change() that is called like this change(100, [1,5,10,25]) and returns a tuple like (4, {25: 4}) meaning that I need 4 coins and the dictionary will be how the coins are changed (all four are quarters).

Below is some sample outputs for this solution

While this solution works, this is painstakingly slow. This is because the code will calculate the value of change(9, [1,5,10,25]) every time you are calculating the change that is larger than 9 (or any other amount). When I timed 40,000 runs of this, it ended up spending 20 minutes calculating all the values. If you add memoization to this, it will take only 11 seconds.

Memoizing this function is slightly more difficult than a regular function that is like f(n). There, you can just have a hash table with n as key. But in our function we need to memoize values like change(9,[1,5,10]) and change(9,[1,5]) in different cells. There is a smart solution in Python cookbook for this problem. You can use Pickle module to convert both 9 and the [1,5,10] array to one single string and store the output of this input in a dictionary.

Running this code 40,000 times only takes 11 seconds as opposed to 20 minutes without memoization. I am however wondering if we can optimize the code even further?

A quick introduction to statistical power

A friend of mine asked me an interesting question today. He said he has one million data points in his experiment yet the p value of a t-test on the average metric value on the two population was not smaller than 0.05, he was wondering if he could safely assume the two populations had the same average metric?

The truth is, the fact that the p value is not smaller than 0.05 can be merely a consequence of your experiment not having enough power. Meaning that there were not large enough samples to be able to reject the null hypothesis. To make this more clear we can picture an extreme case in which instead of one million users we have only 3 users. If p value is not smaller than 0.05 we certainly cannot claim the two population had the same mean value.

Statisticians tend to look at the statistical power. The statistical power is the probability of rejecting the null hypothesis when it is in fact false. Power = P(Reject H0| H0 is false). In other words the power for the test tells you how likely it is that your test is doing its job correctly.

Let’s demonstrate this with a quick R script. My friend told me that the difference in the two means is just 0.10% and since he had 1 million datapoints, he suggested that since the p value of the t-test was not significant then there is no real effect and any difference is due to just noise. Well, I constructed a little experiment, with one million data points, in which the true difference of means was .10%. I ran it 500 times and counted how many times I have correctly rejected the null hypothesis. Turns out that there is about 90% chance that the p value turns out to be non significant even though we already know that there truly is a difference in the mean.

Statistical Power

The command power.t.test() calculates the statistical power for my friend’s test and the power turns out to be slightly above 10%. We calculate the power with the quick simulation in the next lines and after 500 iterations the power value converges to about 0.118. The theoretical value turns out to be 0.105. If he had 10 million data points then his test power would increase from 10.5% to 61% and finally with about 100 million points the power is at about 99.99%. Estimating the number of data points that you need to get a statistically significant result (assuming a true effect exist) is usually referred to as power analysis and sample size determination.

So next time your friend tells you the p value was not significant you can just tell them to increase their sample size to have enough power. Any other conclusion might be reading too much into the data.

Expectation of a function of a random variable

If X is a random variable distributed according to the f(x) as its probability density function then what is the expectation of g(x)?

We know that

E[X] = \int_{-\infty}^{\infty} f(x) dx

it is easy to show that

E[g(X)] = \int_{-\infty}^{\infty} g(x).f(x) dx

For example the expected value of Y= sin(X) if X~U(0,1) would be

E[sin(X)] = \int_{0}^{1} sin(x) dx = -cos(1)+cos(0)

Now the question is what is the

E(e^X) if X \sim N(0,1)

This is in fact the expected value for the log normal distribution and if you do the integration you will arrive at

E(e^X) = e^{ \mu + \frac{1}{2}\sigma^2 }

Python Tips: Adding two dictionaries

I have two dictionaries with numerical values. How can I add these two dictionaries such that the resulting dictionary has the same keys but its values is the summation of the values in my original dictionaries?

In other words we are looking for a way to take x = {'a':1, 'b': 2} and y = {'b':10, 'c': 11, 'a':3, 'd':7} and output {'a': 4, 'c': 11, 'b': 12, 'd': 7} .

The following snippet does exactly that by using a couple of little Python tricks like list dictionary comprehensions.

Let’s analyze this line {a: x.get(a,0)+y.get(a,0) for a in set(x) | set(y)}

  • set(x) | set(y) produces a set whose keys are the union of the keys for x and y
  • Hence for a in set(x) | set(y) results in just the keys for the output set {'a', 'b', 'c', 'd'}
  • x.get(a,0) works just like x[a] with one difference. If a is not among the keys of x it will return 0 instead of throwing a key error
  • Finally a: x.get(a,0)+y.get(a,0) generates a dictionary with key a and the value a: x.get(a,0)+y.get(a,0)

C# Tips: Converting a DateTime from one TimeZone to another TimeZone

How can I convert a date and time from PST to EST?

Below is the easiest way to convert a DateTime value from a Pacific Time to Eastern Time.

The important thing is to specify a kind for your original DateTime for the original DateTime value otherwise the ConvertTime() will raise an exception that reads:

The conversion could not be completed because the supplied DateTime did not have the Kind property set correctly. For example, when the Kind property is DateTimeKind.Local, the source time zone must be TimeZoneInfo.Local.
Parameter name: sourceTimeZone

Knuth-Fisher-Yates Shuffling

Random numbers are hard to work with and they are harder to debug. In this post I will demonstrate the testing of a shuffle function in Python. We first implement an unbiased version of Knuth-Fisher-Yates shuffle and then implement two biased ones. I will then use the functional properties of python to use the same tester for all three .

What is the Knuth-Fisher-Yates shuffling? This shuffling is the correct method for shuffling a list or array. In this method you have a list and you want to shuffle it such that each element is equally likely to appear in any cell. Here is how it works, you have an iterator that goes through the elements of the list one by one, every time you pick an element you generate a random number that is between the position of that element and the position of the last element (inclusive), then you swap the two elements. It is easy to see that each element is equally likely to be appearing in any position in the final list. Below is the correct implementation of the Knuth shuffle in Python.

Allowing the element to be able to be swapped by itself is important. So is not allowing it to be swapped by anything before it. It is, in fact, relatively easy to come up with biased implementations of this function. I have implemented two methods that might look correct at the first look but they are biased and should be avoided. The first implementation takes the element but swaps it with any arbitrary element in the list. The second biased implementation does not allow the element to be swapped by itself.

How can we see this bias and how can we test these functions?
Testing functions that use random number generators are hard. One way of testing these three functions is to look at the probability of each element appearing as the first element of the shuffled list. Ideally we expect the elements to be uniformly distributed in the shuffled list. The tester uses the functional nature of python and here we pass a function to our functions to test.

Output of this shows that the probabilities of any element appearing in the first position is only close to each other in the correct implementation and any of the biased implementations shows a nonuniform distribution.

The complete code is below:

See the follow up post here

Python Tips: Fermat’s Little Theorem, Prime Numbers and Modular Exponentiation

This weekend I was reading about Fermat’s little theorem. It is a very smart way of probabilistically determining whether or not an integer number is prime. Based on Fermat’s little theorem a number n is prime if for any arbitrary positive integer a that is smaller than n the statement a^(n-1) modulo n is equal to 1. This is used in Fermat’s primality test which is a fast probabilistic test for primality. My python implementation for Fermat’s primality test is below but there is a trick in it. I will explain the optimization trick later.

Let’s write a bunch of tests and see the results

There is a very important line in this implementation and that is using this line

instead of this line

The first line is Python’s modular exponentiation notation and is significantly faster than the second line that calculates the pow(a,n-1) and then calculates the modulus. In fact if you use ((pow(a, n-1) % n) != 1) you would probably blow up your python interpreter as soon as it reaches the last line where we multiply a large prime number by itself.

Fermat proved his little theorem in 1640. This theory was used in cryptography in the twentieth century and I cannot help but wonder if he ever found any practical use for his theory in his lifetime.

Implementing a stack and a queue in Python

A friend of mine introduced me to the geeksforgeeks website. Today I was reading a couple of articles there and found an interesting brain teaser. By the way, if you have not seen that website you should definitely check it out. It contains interesting tips for implementing various algorithms.

The question is: “if you have an implementation of a stack how would you use it to implement a queue?”

This by itself is an interesting question, however let’s first implement a stack in Python. If you are familiar with python you might already know that one would usually use arrays when one needs a stack but implementing a stack in python is extremely easy. It is very similar to a linked list implementation. The only difference is the fact that push() writes to the head of the list and pop() takes the first element from the head. Below is my implementation of a stack in python

I am interested to see if someone can simplify this even further. Let’s test our code now.

A more complete implementation of a stack structure with peek() and size is as following. The peek() method returns the last item without removing it. And size() returns the size of the stack by going through all elements so it is O(n)

Now that we have a stack, how can we use this data structure to implement a queue? The trick is you need two stacks. Let’s call these stack1 and stack2. You would push your data in stack1 and pop from stack 2 (if stack 2 is not empty). If stack2 is empty pop values from stack1 and push them one by one into stack2 until stack1 is empty. This logic can be implemented in various ways, one way that I can implement this is the following:

Now that we have implemented our queue using our stack structure let’s test it.

This version of the queue is certainly not the best way to implement the queue (again you can build the queue on top of an array in Python) but it is certainly interesting.