Data Manipulation in Linux, Unix, Cygwin and Powershell

Linux and Unix have excellent little tools for manipulating data files. If Windows is your favorite operating system you can use Cygwin to use almost all of Linux commands. Additionally Windows has an excellent shell system called PowerShell that allows you to do many of these operations. (While Powershell is not covered in this blog post I highly encourage you to check out the Powershell ISE that is an excellent IDE and interactive terminal for Powershell. It is already installed in your Windows but you can also install it from here. Remember that if you are using the home edition of Windows 8 you cannot run it from the start screen, press “Windows+R” to open the Run window and then type “Powershell ISE” it should lunch the IDE and you can pin it to your start menu.)

wget
Let’s first download Francis Galton’s dataset. In the 19th century Francis Galton, determined to understand the relationship between the height of the parents and children collected a lot of data from different people and their children. While doing that he ended up inventing what is now known as the linear regression. In this post we will look into that dataset

this command downloads the dataset and saves it as a csv (comma separated values) file. Let’s print out the content of the file.

head
if you type the above command in your terminal you might see pages and pages of data. One quick command to examine the first couple of lines of the file is the head command

head takes the number of lines as well. For example if you are interested in seeing the top 20 lines of the file you can use the following command

Alternatively you can use head with pipes and pipe a file to it

tail
tail works similar to head except it shows the end of the file

Number of lines
Let’s see how many lines of data, including the header, exists in the file

There are 929 lines of data in the file.

Plotting the data

The “set term dumb” will set the output to terminal, if you leave that out it will plot the graph in a gnuwindow

Galton plot in GNUplot

It is amazing how much you can do with Linux shell. The interesting thing is that you can run all of these commands inside Windows using Cygwin as well

Converting csv to tsv
the sed command allows you to convert a csv file to tsv

Extracting the second column of the data
You can use ‘cut’ to extract columns of data. Columns are defined by a delimiter. For example a comma delimiter is defined by -d’,’. So for extracting the second column you can use the following

If the delimiter is a tab then it is slightly harder, you can either use the actual tab by using control+V as the following (for typing the cut -d’ ‘ type -d’ and then press control+V and then press tab)

or you can use -s in the cut command instead of -d

Children that are 70 inches tall

The grep command allows us to find lines that contain some certain regular expression patterns. We can use it to find the lines that start with 70. This will find all the children who are 70 inches tall

Taking the average value for each column
Linux shell comes with a little language called awk. It allows you to perform calculations on your data

If you compare these values with what you might get from Excel AVG() command you might notice that the two values differ from each other, this is because out little script has counted the header and added it into the count variable while it did not contribute anything to the total variable. The following commands will correct this error. Note that the appearance of (count-1) has nothing to do with the Bessel correction term on average and variance formula.

There is a shorter way to do the following since awk keeps the number of lines in NR (this was suggested by the first commenter, thanks!)

Think about how you can implement the variance function using AWK, you may want to look into the Welford’s online formula for it.
Next: sort, uniq, shuffle, less, sample, cut