Knuth-Fisher-Yates Shuffling

Random numbers are hard to work with and they are harder to debug. In this post I will demonstrate the testing of a shuffle function in Python. We first implement an unbiased version of Knuth-Fisher-Yates shuffle and then implement two biased ones. I will then use the functional properties of python to use the same tester for all three .

What is the Knuth-Fisher-Yates shuffling? This shuffling is the correct method for shuffling a list or array. In this method you have a list and you want to shuffle it such that each element is equally likely to appear in any cell. Here is how it works, you have an iterator that goes through the elements of the list one by one, every time you pick an element you generate a random number that is between the position of that element and the position of the last element (inclusive), then you swap the two elements. It is easy to see that each element is equally likely to be appearing in any position in the final list. Below is the correct implementation of the Knuth shuffle in Python.

Allowing the element to be able to be swapped by itself is important. So is not allowing it to be swapped by anything before it. It is, in fact, relatively easy to come up with biased implementations of this function. I have implemented two methods that might look correct at the first look but they are biased and should be avoided. The first implementation takes the element but swaps it with any arbitrary element in the list. The second biased implementation does not allow the element to be swapped by itself.

How can we see this bias and how can we test these functions?
Testing functions that use random number generators are hard. One way of testing these three functions is to look at the probability of each element appearing as the first element of the shuffled list. Ideally we expect the elements to be uniformly distributed in the shuffled list. The tester uses the functional nature of python and here we pass a function to our functions to test.

Output of this shows that the probabilities of any element appearing in the first position is only close to each other in the correct implementation and any of the biased implementations shows a nonuniform distribution.

The complete code is below:

See the follow up post here

Python Tips: Fermat’s Little Theorem, Prime Numbers and Modular Exponentiation

This weekend I was reading about Fermat’s little theorem. It is a very smart way of probabilistically determining whether or not an integer number is prime. Based on Fermat’s little theorem a number n is prime if for any arbitrary positive integer a that is smaller than n the statement a^(n-1) modulo n is equal to 1. This is used in Fermat’s primality test which is a fast probabilistic test for primality. My python implementation for Fermat’s primality test is below but there is a trick in it. I will explain the optimization trick later.

Let’s write a bunch of tests and see the results

There is a very important line in this implementation and that is using this line

instead of this line

The first line is Python’s modular exponentiation notation and is significantly faster than the second line that calculates the pow(a,n-1) and then calculates the modulus. In fact if you use ((pow(a, n-1) % n) != 1) you would probably blow up your python interpreter as soon as it reaches the last line where we multiply a large prime number by itself.

Fermat proved his little theorem in 1640. This theory was used in cryptography in the twentieth century and I cannot help but wonder if he ever found any practical use for his theory in his lifetime.

Implementing a stack and a queue in Python

A friend of mine introduced me to the geeksforgeeks website. Today I was reading a couple of articles there and found an interesting brain teaser. By the way, if you have not seen that website you should definitely check it out. It contains interesting tips for implementing various algorithms.

The question is: “if you have an implementation of a stack how would you use it to implement a queue?”

This by itself is an interesting question, however let’s first implement a stack in Python. If you are familiar with python you might already know that one would usually use arrays when one needs a stack but implementing a stack in python is extremely easy. It is very similar to a linked list implementation. The only difference is the fact that push() writes to the head of the list and pop() takes the first element from the head. Below is my implementation of a stack in python

I am interested to see if someone can simplify this even further. Let’s test our code now.

A more complete implementation of a stack structure with peek() and size is as following. The peek() method returns the last item without removing it. And size() returns the size of the stack by going through all elements so it is O(n)

Now that we have a stack, how can we use this data structure to implement a queue? The trick is you need two stacks. Let’s call these stack1 and stack2. You would push your data in stack1 and pop from stack 2 (if stack 2 is not empty). If stack2 is empty pop values from stack1 and push them one by one into stack2 until stack1 is empty. This logic can be implemented in various ways, one way that I can implement this is the following:

Now that we have implemented our queue using our stack structure let’s test it.

This version of the queue is certainly not the best way to implement the queue (again you can build the queue on top of an array in Python) but it is certainly interesting.

Counting the number of lines in a data file

One of the most important tasks when working with a data files is to know how many lines exist in the file. This is important since sometimes your data processing software (i.e., R, Excel or Python) fails to properly import all of the lines from the file hence you are losing data without noticing it. I always check the number of rows imported in R against the number of lines in the file. There are various ways to do that.

First let’s read in a sample data file

Below I have found a couple of different ways to print the number of lines for any text file. I have intentionally only focused on shell commands and have not included codes for any high level language like Java or C# (except for R, I provide a little R script for counting lines.)

Solution 1: wc (word count)
wc is the easiest way to count the number of lines of a text file in Linux and Windows (you need to install Cygwin in Windows)

In case you want to just extract the number of lines then pipe the output into a cut command

Solution 2: sed
The following gives you the number of lines using sed

Solution 3: R
If you are importing your data in R, then it might be convinent to check the number of lines for the data file inside R and make sure it is equal to the number of rows that are imported. Keep in mind that if the file has a header then the number of rows is 1 unit less than the number of lines in the file.

Note that we are not using the read.table function in r for doing this. Using read.table will conceal any problem that may happen during the data import.

Solution 4: Powershell (windows only)
Powershell’s object model allows for an easy way to count the number of lines in our Galton file

Note that if your file is really large, this method will fail and powershell cannot store the whole file into memory. You need to use IO.StreamReader in Powershell for large files. See the following script that opens the file as an stream without keeping the whole file in memory.

Solution 5: awk

You can also use awk to count the number of lines in a file.

Solution 6: grep
Using grep for such an easy task might be an overkill but here it is

 

Solution 7: perl

There you go. These were a couple of different ways to make sure the data that you are reading is intact. It does not really matter which method you use, the only thing that matters is that you always need to check to make sure no line is missing.

Footnotes: cat -n Galton.csv and nl Galton.csv can also tell you the number of lines but they output each line with their line number and that can be inconvenient for larger files.

Converting a csv file to a tsv file and vice versa

I used to use Excel to convert a tsv file to a csv file. However, there is a very quick way to convert csv and tsv files to each other by using a sed one-liner. sed is a handy little tool that is available on both Linux and Windows (on Windows you need to install cygwin). It is used for quick find and replace in files (however it can do a lot more and I will cover some of its other uses later).

Now let’d download a csv file and convert it to tsv quickly. Let’s first use wget to download this file into our local directory

Let’s see how our file looks like by using the head command in the terminal

sed can plow through a file and change any instance of a particular string to another string for example if we want to convert all the instances of “day” to “night” we can use the following

Now we can use this pattern to replace every instance of commas ‘,’ to a tab ‘\t’

Note that you need single qoutes around ‘s/,/\t/’ otherwise it will not work. To convert a tsv to a csv you can simply reverse the order of your comma and \t and use the following

An alternative method would be to pipe the file to a tr command and replace the commas with tabs using tr

Now you can convert csv files to tsv files by a sed one-liner. However, these one liners are not guaranteed to convert any arbitrary csv file. For those you might end up using the csv parsers inside Excel or use the csv module for Python or use R.

Other useful references:

  • This stack overflow question contains more information and more generalizable techniques to convert csv and tsv files to each other.
  • This post uses python to do the same type of conversion

R Tips: how to initialize an empty data frame

If you use R for all your daily work then you might sometimes need to initialize an empty data frame and then append data to it by using rbind(). This is an extremely inefficient process since R needs to reallocated memory every time you use something like a <- rbind(a, b). However, there are many places that you cannot preallocate memory to your initial data frame and need to start with an empty data frame and grow it gradually.

One particular example scenario where this method becomes handy is when you have a bunch of similar csv files that contain more or less the same information, but each file is for a different date and the number of rows in each file might vary slightly. For example assume you are storing the stock prices for different companies in files like “2012-3-4.csv” and “2012-3-5.csv”. Below are the contents of the two files. Note that the first file has 3 rows while the second one has 2. The number of rows in each file is not known to the programmer apriori so it is not feasible to preallocate the memory to the data frame properly.

 

You may want to have a data frame that contains a concatenation of the two files with the date attached as the third column

We can initialize an empty data frame with proper data types to store all of the data

Now we can open each file, read its contents and write that to our empty data frame

Resulting data frame

This way you can initialize an empty data frame, then loop through the files and append to it. This pattern, as mentioned before, is very inefficient and should be avoided for large amount of data. But it is a handy little trick if performance is not of an immediate concern.

Data Manipulation in Linux, Unix, Cygwin and Powershell

Linux and Unix have excellent little tools for manipulating data files. If Windows is your favorite operating system you can use Cygwin to use almost all of Linux commands. Additionally Windows has an excellent shell system called PowerShell that allows you to do many of these operations. (While Powershell is not covered in this blog post I highly encourage you to check out the Powershell ISE that is an excellent IDE and interactive terminal for Powershell. It is already installed in your Windows but you can also install it from here. Remember that if you are using the home edition of Windows 8 you cannot run it from the start screen, press “Windows+R” to open the Run window and then type “Powershell ISE” it should lunch the IDE and you can pin it to your start menu.)

wget
Let’s first download Francis Galton’s dataset. In the 19th century Francis Galton, determined to understand the relationship between the height of the parents and children collected a lot of data from different people and their children. While doing that he ended up inventing what is now known as the linear regression. In this post we will look into that dataset

this command downloads the dataset and saves it as a csv (comma separated values) file. Let’s print out the content of the file.

head
if you type the above command in your terminal you might see pages and pages of data. One quick command to examine the first couple of lines of the file is the head command

head takes the number of lines as well. For example if you are interested in seeing the top 20 lines of the file you can use the following command

Alternatively you can use head with pipes and pipe a file to it

tail
tail works similar to head except it shows the end of the file

Number of lines
Let’s see how many lines of data, including the header, exists in the file

There are 929 lines of data in the file.

Plotting the data

The “set term dumb” will set the output to terminal, if you leave that out it will plot the graph in a gnuwindow

Galton plot in GNUplot

It is amazing how much you can do with Linux shell. The interesting thing is that you can run all of these commands inside Windows using Cygwin as well

Converting csv to tsv
the sed command allows you to convert a csv file to tsv

Extracting the second column of the data
You can use ‘cut’ to extract columns of data. Columns are defined by a delimiter. For example a comma delimiter is defined by -d’,’. So for extracting the second column you can use the following

If the delimiter is a tab then it is slightly harder, you can either use the actual tab by using control+V as the following (for typing the cut -d’ ‘ type -d’ and then press control+V and then press tab)

or you can use -s in the cut command instead of -d

Children that are 70 inches tall

The grep command allows us to find lines that contain some certain regular expression patterns. We can use it to find the lines that start with 70. This will find all the children who are 70 inches tall

Taking the average value for each column
Linux shell comes with a little language called awk. It allows you to perform calculations on your data

If you compare these values with what you might get from Excel AVG() command you might notice that the two values differ from each other, this is because out little script has counted the header and added it into the count variable while it did not contribute anything to the total variable. The following commands will correct this error. Note that the appearance of (count-1) has nothing to do with the Bessel correction term on average and variance formula.

There is a shorter way to do the following since awk keeps the number of lines in NR (this was suggested by the first commenter, thanks!)

Think about how you can implement the variance function using AWK, you may want to look into the Welford’s online formula for it.
Next: sort, uniq, shuffle, less, sample, cut

The story of two lovers and the beauty of statistics

[The original title for this post was “The story of two lovers, joint densities and the beauty of statistics”. I dropped the “joint densities” from the title]

Recently a friend of mine gave me a puzzle. The puzzle is as follows:

Many years ago there were two people, secretly in love, and were traveling with a group of friends who did not know they are in love. The large group of friends were going around Europe by bus. Stopping in major cities and staying in hotels. In the long hallway of hotels, every person would get a separate room. Every night one of these lovers would sneak out of their room and go to the other’s room for an innocent cuddling. The puzzle is: if the long hallway is L meters long and rooms are uniformly spread along the hallway. What would be the average length that one of them needs to tiptoe to reach the other one.

Depending on how you want to solve this problem this can be an easy or a relatively hard problem. We will solve it two ways. First we can solve it with simulation and next we can validate the simulation with an analytical solution.

Let’s first simplify the problem. Assume the location of one of them is X \sim U(0,L) and the other is located at Y \sim U(0,L). We want to find the average value for $latex Z= |X-Y|

And the histogram of Z values is shown below. Z is following a triangular distribution.
z values

If R is your favorite language then you might find the rgl and akima libraries pretty useful here. You can produce visualizations of Z values (vs X and Y) using R.

Z valuesZ vs (X, Y)

This problem can be solved in 6 lines of python or R (perhaps two line if you don’t mind a compressed notation). But can how can we find a closed form solution for this?

Analytical solution
We have X \sim U(0,L) and Y \sim U(0,L). We want to find the average value for E[Z]= E[|X-Y|]. Additionally we know that Y and X are independent X \perp Y. The probability density function for X and Y are f_X(x) = 1/L and f_Y(y) = 1/L. Since the two variables are independent then the expectation of the joint probability is in the form of:

E[Z] = \int_0^L \int_0^L |x-y|f_X(x)f_Y(y)dy dx

By replacing f_X(x) = 1/L and f_Y(y) = 1/L we have:

E[Z] = \int_0^L \int_0^L |x-y| \frac{1}{L^2} dy dx

We can now remove the absolute value by conditioning on the value of X-Y. Meaning, if X-Y>0 then |X-Y| = X-Y otherwise |X-Y| = Y-X.

E[Z] = \frac{1}{L^2} \int_0^L \int_0^x (x-y) dy dx + \int_0^L \int_x^L (y-x) dy dx

Therefore:

E[Z] = \frac{1}{L^2} \int_0^L (xy-1/2y^2)_{y=0}^{y=x} dx + \int_0^L (1/2y^2-xy)_{y=x}^{y=L} dx

This is equal to \frac{1}{3} L. So the two lovers would, on average, walk one third of the hallway to see each other.

Also see:
Average Distance Between Random Points on a Line

Python Tips: Converting a string to a list of characters

Sometimes you may need to convert a string to a list of characters (for example you may want to calculate a hash value of the string using the folding method). If you want to break the string on spaces then you can simply use mystring.split() . For splitting the string into individual characters you can simply use list(mystring) .

However, in practice we almost never build our own hash function. Python has a default hash() function. For cryptography you might use the Hashlib module though.

Interesting questions: Anagram Numbers

I found this old question in the archives of the British informatics olympiads

Question 1: Anagram Numbers
An anagram number is a number that can be multiplied by at least one single digit number (other than 1) to become an anagram of itself. Any anagrams formed by multiplying an anagram number by a digit are said to be generated by that anagram number. Two numbers are anagrams of each other if they can both be formed by rearranging the same combination of digits.

For example:

  • 1246878 is an anagram number; multiplying by 6 generates 7481268 or by 7 generates 8728146. These numbers all contain a single 1, 2, 4, 6, 7 and two 8s.
  • 1246879 is not an anagram number.

1(a) [ 25 marks ]

Write a program which reads in a single number (between 1 and 123456789 inclusive) and determines if it is an anagram number. If the number is not an anagram number you should output the word NO. If it is an anagram number you should output each single digit it can be multiplied by to make an anagram of itself.

  • Sample run 1
    123456789
    2 4 5 7 8
  • Sample run 2
    100
    NO

1(b) [ 2 marks ]

85247910 is generated by which anagram numbers?
1(c) [ 3 marks ]
How many anagram numbers between 100,000 and 999,999 contain no duplicated digits?

Here is a quick solution to this program. I am wondering if there is any solution to this that is more efficient?

Results