Counting the number of lines in a data file

One of the most important tasks when working with a data files is to know how many lines exist in the file. This is important since sometimes your data processing software (i.e., R, Excel or Python) fails to properly import all of the lines from the file hence you are losing data without noticing it. I always check the number of rows imported in R against the number of lines in the file. There are various ways to do that.

First let’s read in a sample data file

Below I have found a couple of different ways to print the number of lines for any text file. I have intentionally only focused on shell commands and have not included codes for any high level language like Java or C# (except for R, I provide a little R script for counting lines.)

Solution 1: wc (word count)
wc is the easiest way to count the number of lines of a text file in Linux and Windows (you need to install Cygwin in Windows)

In case you want to just extract the number of lines then pipe the output into a cut command

Solution 2: sed
The following gives you the number of lines using sed

Solution 3: R
If you are importing your data in R, then it might be convinent to check the number of lines for the data file inside R and make sure it is equal to the number of rows that are imported. Keep in mind that if the file has a header then the number of rows is 1 unit less than the number of lines in the file.

Note that we are not using the read.table function in r for doing this. Using read.table will conceal any problem that may happen during the data import.

Solution 4: Powershell (windows only)
Powershell’s object model allows for an easy way to count the number of lines in our Galton file

Note that if your file is really large, this method will fail and powershell cannot store the whole file into memory. You need to use IO.StreamReader in Powershell for large files. See the following script that opens the file as an stream without keeping the whole file in memory.

Solution 5: awk

You can also use awk to count the number of lines in a file.

Solution 6: grep
Using grep for such an easy task might be an overkill but here it is

 

Solution 7: perl

There you go. These were a couple of different ways to make sure the data that you are reading is intact. It does not really matter which method you use, the only thing that matters is that you always need to check to make sure no line is missing.

Footnotes: cat -n Galton.csv and nl Galton.csv can also tell you the number of lines but they output each line with their line number and that can be inconvenient for larger files.

Converting a csv file to a tsv file and vice versa

I used to use Excel to convert a tsv file to a csv file. However, there is a very quick way to convert csv and tsv files to each other by using a sed one-liner. sed is a handy little tool that is available on both Linux and Windows (on Windows you need to install cygwin). It is used for quick find and replace in files (however it can do a lot more and I will cover some of its other uses later).

Now let’d download a csv file and convert it to tsv quickly. Let’s first use wget to download this file into our local directory

Let’s see how our file looks like by using the head command in the terminal

sed can plow through a file and change any instance of a particular string to another string for example if we want to convert all the instances of “day” to “night” we can use the following

Now we can use this pattern to replace every instance of commas ‘,’ to a tab ‘\t’

Note that you need single qoutes around ‘s/,/\t/’ otherwise it will not work. To convert a tsv to a csv you can simply reverse the order of your comma and \t and use the following

An alternative method would be to pipe the file to a tr command and replace the commas with tabs using tr

Now you can convert csv files to tsv files by a sed one-liner. However, these one liners are not guaranteed to convert any arbitrary csv file. For those you might end up using the csv parsers inside Excel or use the csv module for Python or use R.

Other useful references:

  • This stack overflow question contains more information and more generalizable techniques to convert csv and tsv files to each other.
  • This post uses python to do the same type of conversion

R Tips: how to initialize an empty data frame

If you use R for all your daily work then you might sometimes need to initialize an empty data frame and then append data to it by using rbind(). This is an extremely inefficient process since R needs to reallocated memory every time you use something like a <- rbind(a, b). However, there are many places that you cannot preallocate memory to your initial data frame and need to start with an empty data frame and grow it gradually.

One particular example scenario where this method becomes handy is when you have a bunch of similar csv files that contain more or less the same information, but each file is for a different date and the number of rows in each file might vary slightly. For example assume you are storing the stock prices for different companies in files like “2012-3-4.csv” and “2012-3-5.csv”. Below are the contents of the two files. Note that the first file has 3 rows while the second one has 2. The number of rows in each file is not known to the programmer apriori so it is not feasible to preallocate the memory to the data frame properly.

 

You may want to have a data frame that contains a concatenation of the two files with the date attached as the third column

We can initialize an empty data frame with proper data types to store all of the data

Now we can open each file, read its contents and write that to our empty data frame

Resulting data frame

This way you can initialize an empty data frame, then loop through the files and append to it. This pattern, as mentioned before, is very inefficient and should be avoided for large amount of data. But it is a handy little trick if performance is not of an immediate concern.

Data Manipulation in Linux, Unix, Cygwin and Powershell

Linux and Unix have excellent little tools for manipulating data files. If Windows is your favorite operating system you can use Cygwin to use almost all of Linux commands. Additionally Windows has an excellent shell system called PowerShell that allows you to do many of these operations. (While Powershell is not covered in this blog post I highly encourage you to check out the Powershell ISE that is an excellent IDE and interactive terminal for Powershell. It is already installed in your Windows but you can also install it from here. Remember that if you are using the home edition of Windows 8 you cannot run it from the start screen, press “Windows+R” to open the Run window and then type “Powershell ISE” it should lunch the IDE and you can pin it to your start menu.)

wget
Let’s first download Francis Galton’s dataset. In the 19th century Francis Galton, determined to understand the relationship between the height of the parents and children collected a lot of data from different people and their children. While doing that he ended up inventing what is now known as the linear regression. In this post we will look into that dataset

this command downloads the dataset and saves it as a csv (comma separated values) file. Let’s print out the content of the file.

head
if you type the above command in your terminal you might see pages and pages of data. One quick command to examine the first couple of lines of the file is the head command

head takes the number of lines as well. For example if you are interested in seeing the top 20 lines of the file you can use the following command

Alternatively you can use head with pipes and pipe a file to it

tail
tail works similar to head except it shows the end of the file

Number of lines
Let’s see how many lines of data, including the header, exists in the file

There are 929 lines of data in the file.

Plotting the data

The “set term dumb” will set the output to terminal, if you leave that out it will plot the graph in a gnuwindow

Galton plot in GNUplot

It is amazing how much you can do with Linux shell. The interesting thing is that you can run all of these commands inside Windows using Cygwin as well

Converting csv to tsv
the sed command allows you to convert a csv file to tsv

Extracting the second column of the data
You can use ‘cut’ to extract columns of data. Columns are defined by a delimiter. For example a comma delimiter is defined by -d’,’. So for extracting the second column you can use the following

If the delimiter is a tab then it is slightly harder, you can either use the actual tab by using control+V as the following (for typing the cut -d’ ‘ type -d’ and then press control+V and then press tab)

or you can use -s in the cut command instead of -d

Children that are 70 inches tall

The grep command allows us to find lines that contain some certain regular expression patterns. We can use it to find the lines that start with 70. This will find all the children who are 70 inches tall

Taking the average value for each column
Linux shell comes with a little language called awk. It allows you to perform calculations on your data

If you compare these values with what you might get from Excel AVG() command you might notice that the two values differ from each other, this is because out little script has counted the header and added it into the count variable while it did not contribute anything to the total variable. The following commands will correct this error. Note that the appearance of (count-1) has nothing to do with the Bessel correction term on average and variance formula.

There is a shorter way to do the following since awk keeps the number of lines in NR (this was suggested by the first commenter, thanks!)

Think about how you can implement the variance function using AWK, you may want to look into the Welford’s online formula for it.
Next: sort, uniq, shuffle, less, sample, cut