'Sparklyr performance comparison in R to other on disk solutions like SAS. Remove duplicates using distinct takes hours in Sparklyr, seconds in SAS
I was hoping to receive some clarification on optimizing Sparklyr performance in R on my local machine.
I have imported a CSV file with 211 million rows (CSV is 17 gigabytes, so wont fit in memory), with just a few columns, and I would like to only select the distinct values for one of the columns. To accomplish this I imported the data as "test" using spark_read_csv Memory = FALSE and a data generated schema saved separately in its own object (the import took a few minutes).
After importing using the function I ran very basic code to dedpulicate one column.
It has been running for 2 hours, so I decided to try using SAS. I was able to accomplish what I needed in a few minutes.
This seems very problematic to me, even if I am using a local machine it does not seem like a very difficult problem.
sc <- spark_connect(master = "local", version = "2.3")
download <- function(datapath, dataname) {
spec_with_r <- sapply(read.csv(datapath, nrows = 1000), class)
#spec_explicit <- c(x = "character", y = "numeric")
system.time(dataname <- spark_read_csv(
sc,
path = datapath,
columns = spec_with_r,
Memory = FALSE
))
return(dataname)
}
test <- download("./data/metastases17.csv", test)
test2 <- test %>% select(DX) %>% distinct()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
