'Sparklyr performance comparison in R to other on disk solutions like SAS. Remove duplicates using distinct takes hours in Sparklyr, seconds in SAS

I was hoping to receive some clarification on optimizing Sparklyr performance in R on my local machine.

I have imported a CSV file with 211 million rows (CSV is 17 gigabytes, so wont fit in memory), with just a few columns, and I would like to only select the distinct values for one of the columns. To accomplish this I imported the data as "test" using spark_read_csv Memory = FALSE and a data generated schema saved separately in its own object (the import took a few minutes).

After importing using the function I ran very basic code to dedpulicate one column.

It has been running for 2 hours, so I decided to try using SAS. I was able to accomplish what I needed in a few minutes.

This seems very problematic to me, even if I am using a local machine it does not seem like a very difficult problem.

sc <- spark_connect(master = "local", version = "2.3")

download <- function(datapath, dataname) {
  spec_with_r <- sapply(read.csv(datapath, nrows = 1000), class)
  #spec_explicit <- c(x = "character", y = "numeric")
  system.time(dataname <- spark_read_csv(
    sc,
    path = datapath,
    columns = spec_with_r,
    Memory = FALSE
  ))
  return(dataname)
}

test <- download("./data/metastases17.csv", test)

test2 <- test %>% select(DX) %>% distinct()


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source