'base R faster than readr for reading multiple CSV files
There is a lot of documentation on how to read multiple CSVs and bind them into one data frame. I have 5000+ CSV files I need to read in and bind into one data structure.
In particular I've followed the discussion here: Issue in Loading multiple .csv files into single dataframe in R using rbind
The weird thing is that base R is much faster than any other solution I've tried.
Here's what my CSV looks like:
> head(PT)
Line Timestamp Lane.01 Lane.02 Lane.03 Lane.04 Lane.05 Lane.06 Lane.07 Lane.08
1 PL1 05-Jan-16 07:17:36 NA NA NA NA NA NA NA NA
2 PL1 05-Jan-16 07:22:38 NA NA NA NA NA NA NA NA
3 PL1 05-Jan-16 07:27:41 NA NA NA NA NA NA NA NA
4 PL1 05-Jan-16 07:32:43 9.98 10.36 10.41 10.16 10.10 9.97 10.07 9.59
5 PL1 05-Jan-16 07:37:45 9.65 8.87 9.88 9.86 8.85 8.75 9.19 8.51
6 PL1 05-Jan-16 07:42:47 9.14 8.98 9.29 9.04 9.01 9.06 9.12 9.08
I've created three methods for reading in and binding the data. The files are located in a separate directory which I define as:
dataPath <- "data"
PTfiles <- list.files(path=dataPath, full.names = TRUE)
Method 1: Base R
classes <- c("factor", "character", rep("numeric",8))
# build function to load data
load_data <- function(dataPath, classes) {
tables <- lapply(PTfiles, read.csv, colClasses=classes, na.strings=c("NA", ""))
do.call(rbind, tables)
}
#clock
method1 <- system.time(
PT <- load_data(path, classes)
)
Method 2: read_csv
In this case I created a wrapper function for read_csv to use
#create wrapper function for read_csv
read_csv.wrap <- function(x) { read_csv(x, skip = 1, na=c("NA", ""),
col_names = c("tool", "timestamp", paste("lane", 1:8, sep="")),
col_types =
cols(
tool = col_character(),
timestamp = col_character(),
lane1 = col_double(),
lane2 = col_double(),
lane3 = col_double(),
lane4 = col_double(),
lane5 = col_double(),
lane6 = col_double(),
lane7 = col_double(),
lane8 = col_double()
)
)
}
##
# Same as method 1, just uses read_csv instead of read.csv
load_data2 <- function(dataPath) {
tables <- lapply(PTfiles, read_csv.wrap)
do.call(rbind, tables)
}
#clock
method2 <- system.time(
PT2 <- load_data2(path)
)
Method 3: read_csv + dplyr::bind_rows
load_data3 <- function(dataPath) {
tables <- lapply(PTfiles, read_csv.wrap)
dplyr::bind_rows(tables)
}
#clock
method3 <- system.time(
PT3 <- load_data3(path)
)
What I can't figure out, is why read_csv and dplyr methods are slower for elapsed time when they should be faster. The CPU time is decreased, but why would the elapsed time (file system) increase? What's going on here?
Edit - I added the data.table method as suggested in the comments
Method 4 data.table
library(data.table)
load_data4 <- function(dataPath){
tables <- lapply(PTfiles, fread)
rbindlist(tables)
}
method4 <- system.time(
PT4 <- load_data4(path)
)
The data.table method is the fastest from a CPU standpoint. But the question still stands on what is going on with the read_csv methods that makes them so slow.
> rbind(method1, method2, method3, method4)
user.self sys.self elapsed
method1 0.56 0.39 1.35
method2 0.42 1.98 13.96
method3 0.36 2.25 14.69
method4 0.34 0.67 1.74
Solution 1:[1]
I would do that in the terminal(Unix). I would put all files int the same folder and then navigate to that folder (in terminal), the use the following command to create only one CSV file:
cat *.csv > merged_csv_file.csv
One observation regarding this method is that the header of each file will show up in the middle of the observations. To solve this I would suggest you do:
Get just the header from the first file
head -2 file1.csv > merged_csv_file.csv
then skip the first "X" lines from the other files, with the folling command, where "X" is the number of lines to skip.
tail -n +3 -q file*.csv >> merged_csv_file.csv
-n +3 makes tail print lines from 3rd to the end, -q tells it not to print the header with the file name (read man), >> adds to the file, not overwrites it as >.
Solution 2:[2]
I might have found a related issue. I am reading in nested CSV data from some simulation output, where multiple columns have CSV formatted data as elements, which I need to unnest and reshape for analysis.
With simulations where I have many runs, this resulted in thousands of elements that needed to be parsed. Using map(.,read_csv) this would take hours to transform. When I rewrote my script to apply read.csv in a lambda function, the operation would complete in seconds.
I'm curious if there is some intermediate system I/O operation or error handling that creates a bottleneck you wouldn't run into with a single input file.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Marcio Rodrigues |
| Solution 2 | jlamb |
