'base R faster than readr for reading multiple CSV files

There is a lot of documentation on how to read multiple CSVs and bind them into one data frame. I have 5000+ CSV files I need to read in and bind into one data structure.

In particular I've followed the discussion here: Issue in Loading multiple .csv files into single dataframe in R using rbind

The weird thing is that base R is much faster than any other solution I've tried.

Here's what my CSV looks like:

> head(PT)
  Line          Timestamp       Lane.01 Lane.02 Lane.03 Lane.04 Lane.05 Lane.06 Lane.07 Lane.08
1    PL1    05-Jan-16 07:17:36      NA      NA      NA      NA      NA      NA      NA      NA
2    PL1    05-Jan-16 07:22:38      NA      NA      NA      NA      NA      NA      NA      NA
3    PL1    05-Jan-16 07:27:41      NA      NA      NA      NA      NA      NA      NA      NA
4    PL1    05-Jan-16 07:32:43    9.98   10.36   10.41   10.16   10.10    9.97   10.07    9.59
5    PL1    05-Jan-16 07:37:45    9.65    8.87    9.88    9.86    8.85    8.75    9.19    8.51
6    PL1    05-Jan-16 07:42:47    9.14    8.98    9.29    9.04    9.01    9.06    9.12    9.08

I've created three methods for reading in and binding the data. The files are located in a separate directory which I define as:

dataPath <- "data"
PTfiles <- list.files(path=dataPath, full.names = TRUE)

Method 1: Base R

classes <- c("factor", "character", rep("numeric",8))

# build function to load data
load_data <- function(dataPath, classes) { 
   tables <- lapply(PTfiles, read.csv, colClasses=classes, na.strings=c("NA", ""))
   do.call(rbind, tables)
}

#clock
method1 <- system.time(
   PT <- load_data(path, classes)
)

Method 2: read_csv In this case I created a wrapper function for read_csv to use

#create wrapper function for read_csv
read_csv.wrap <- function(x) { read_csv(x, skip = 1, na=c("NA", ""),
                      col_names = c("tool", "timestamp", paste("lane", 1:8, sep="")),
                      col_types = 
                         cols(
                            tool = col_character(),
                            timestamp = col_character(),
                            lane1 = col_double(),
                            lane2 = col_double(),
                            lane3 = col_double(),
                            lane4 = col_double(),
                            lane5 = col_double(),
                            lane6 = col_double(),
                            lane7 = col_double(),
                            lane8 = col_double()
                           )
                     )
}

##
# Same as method 1, just uses read_csv instead of read.csv

load_data2 <- function(dataPath) { 
   tables <- lapply(PTfiles, read_csv.wrap)
   do.call(rbind, tables)
}

#clock
method2 <- system.time(
   PT2 <- load_data2(path)
)

Method 3: read_csv + dplyr::bind_rows

load_data3 <- function(dataPath) { 
   tables <- lapply(PTfiles, read_csv.wrap)
   dplyr::bind_rows(tables)
}

#clock
method3 <- system.time(
   PT3 <- load_data3(path)
)

What I can't figure out, is why read_csv and dplyr methods are slower for elapsed time when they should be faster. The CPU time is decreased, but why would the elapsed time (file system) increase? What's going on here?

Edit - I added the data.table method as suggested in the comments

Method 4 data.table

library(data.table)

load_data4 <- function(dataPath){
   tables <- lapply(PTfiles, fread)
   rbindlist(tables)
}

method4 <- system.time(
   PT4 <- load_data4(path)
)

The data.table method is the fastest from a CPU standpoint. But the question still stands on what is going on with the read_csv methods that makes them so slow.

> rbind(method1, method2, method3, method4)
        user.self sys.self elapsed
method1      0.56     0.39    1.35
method2      0.42     1.98   13.96
method3      0.36     2.25   14.69
method4      0.34     0.67    1.74


Solution 1:[1]

I would do that in the terminal(Unix). I would put all files int the same folder and then navigate to that folder (in terminal), the use the following command to create only one CSV file:

cat *.csv > merged_csv_file.csv

One observation regarding this method is that the header of each file will show up in the middle of the observations. To solve this I would suggest you do:

Get just the header from the first file

head -2 file1.csv > merged_csv_file.csv

then skip the first "X" lines from the other files, with the folling command, where "X" is the number of lines to skip.

tail -n +3 -q file*.csv >> merged_csv_file.csv

-n +3 makes tail print lines from 3rd to the end, -q tells it not to print the header with the file name (read man), >> adds to the file, not overwrites it as >.

Solution 2:[2]

I might have found a related issue. I am reading in nested CSV data from some simulation output, where multiple columns have CSV formatted data as elements, which I need to unnest and reshape for analysis.

With simulations where I have many runs, this resulted in thousands of elements that needed to be parsed. Using map(.,read_csv) this would take hours to transform. When I rewrote my script to apply read.csv in a lambda function, the operation would complete in seconds.

I'm curious if there is some intermediate system I/O operation or error handling that creates a bottleneck you wouldn't run into with a single input file.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Marcio Rodrigues
Solution 2 jlamb