If you use R for all your daily work then you might sometimes need to initialize an empty data frame and then append data to it by using rbind()
. This is an extremely inefficient process since R needs to reallocated memory every time you use something like a <- rbind(a, b)
. However, there are many places that you cannot preallocate memory to your initial data frame and need to start with an empty data frame and grow it gradually.
One particular example scenario where this method becomes handy is when you have a bunch of similar csv files that contain more or less the same information, but each file is for a different date and the number of rows in each file might vary slightly. For example assume you are storing the stock prices for different companies in files like “2012-3-4.csv” and “2012-3-5.csv”. Below are the contents of the two files. Note that the first file has 3 rows while the second one has 2. The number of rows in each file is not known to the programmer apriori so it is not feasible to preallocate the memory to the data frame properly.
1 2 3 4 5 6 7 8 9 10 |
$ cat 2012-3-4.csv MSFT 23.4 AAPL 98.4 IBM 102.2 $ cat 2012-3-5.csv MSFT 23.5 AAPL 98.3 |
You may want to have a data frame that contains a concatenation of the two files with the date attached as the third column
1 2 3 4 5 |
MSFT 23.4 2012-3-4.csv AAPL 98.4 2012-3-4.csv IBM 102.2 2012-3-4.csv MSFT 23.5 2012-3-5.csv AAPL 98.3 2012-3-5.csv |
We can initialize an empty data frame with proper data types to store all of the data
1 2 3 4 5 6 |
data <- data.frame( ticker=character(), value=numeric(), date = as.Date(character()), stringsAsFactors=FALSE ) |
Now we can open each file, read its contents and write that to our empty data frame
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
data <- data.frame( ticker=character(), value=numeric(), date = as.Date(character()), stringsAsFactors=FALSE ) # here read the contents of files into the dataframe for (file in dir()) { stocks <- read.csv2(file, sep = " ", header = FALSE) stocks$date <- sub(".csv", "", file) names(stocks)<- c("ticker", "value", "date") data <- rbind(data, stocks) } |
Resulting data frame
1 2 3 4 5 6 7 |
> data ticker value date 1 MSFT 23.4 2012-3-4 2 AAPL 98.4 2012-3-4 3 IBM 102.2 2012-3-4 4 MSFT 23.5 2012-3-5 5 AAPL 98.3 2012-3-5 |
This way you can initialize an empty data frame, then loop through the files and append to it. This pattern, as mentioned before, is very inefficient and should be avoided for large amount of data. But it is a handy little trick if performance is not of an immediate concern.
You should look at rbindlist from the package data.table. Blazing fast!