Counting the number of lines in a data file

One of the most important tasks when working with a data files is to know how many lines exist in the file. This is important since sometimes your data processing software (i.e., R, Excel or Python) fails to properly import all of the lines from the file hence you are losing data without noticing it. I always check the number of rows imported in R against the number of lines in the file. There are various ways to do that.

First let’s read in a sample data file

Below I have found a couple of different ways to print the number of lines for any text file. I have intentionally only focused on shell commands and have not included codes for any high level language like Java or C# (except for R, I provide a little R script for counting lines.)

Solution 1: wc (word count)
wc is the easiest way to count the number of lines of a text file in Linux and Windows (you need to install Cygwin in Windows)

In case you want to just extract the number of lines then pipe the output into a cut command

Solution 2: sed
The following gives you the number of lines using sed

Solution 3: R
If you are importing your data in R, then it might be convinent to check the number of lines for the data file inside R and make sure it is equal to the number of rows that are imported. Keep in mind that if the file has a header then the number of rows is 1 unit less than the number of lines in the file.

Note that we are not using the read.table function in r for doing this. Using read.table will conceal any problem that may happen during the data import.

Solution 4: Powershell (windows only)
Powershell’s object model allows for an easy way to count the number of lines in our Galton file

Note that if your file is really large, this method will fail and powershell cannot store the whole file into memory. You need to use IO.StreamReader in Powershell for large files. See the following script that opens the file as an stream without keeping the whole file in memory.

Solution 5: awk

You can also use awk to count the number of lines in a file.

Solution 6: grep
Using grep for such an easy task might be an overkill but here it is

 

Solution 7: perl

There you go. These were a couple of different ways to make sure the data that you are reading is intact. It does not really matter which method you use, the only thing that matters is that you always need to check to make sure no line is missing.

Footnotes: cat -n Galton.csv and nl Galton.csv can also tell you the number of lines but they output each line with their line number and that can be inconvenient for larger files.

Leave a Reply

Your email address will not be published. Required fields are marked *