'R: Using STRSPLIT and GREP on vector elements on large dataset takes too long

(My first StackFlow question)

My goal is to improve the ETL process for identifying which NetApp file shares are related to which AD permission distribution groups. Currently an application named 'TreeSize' scans a number of volumes and outputs a number of large .CSV files (35mb+). I want to merge this data and remove all permission information where each group (or named user) doesn't start with a capital G or D ('^[GD]'). With over 700,000 rows to process, it's currently taking me over 24hr to run. I hope there is a better way to process this data more efficiently to drastically cut that time down.

Here is test data which resembles actual data once all files have been merged. Use rownum to adjust size of data. (Real data 700000+)

Test Data

set.seed(42)
rownum <- 2000  #Real number over 700000
i <- 1
datalist <- list()

while (i <= rownum) {
  
  randomStr1 <- paste(sample(c(0:9, letters, LETTERS[4:7], "-"),10, replace=TRUE),collapse="")
  randomStr2 <- paste(sample(c(0:9, letters, LETTERS[4:7], " "),10, replace=TRUE),collapse="")
  randomStr3 <- paste(sample(c(0:9, letters, LETTERS[4:7], " & "),10, replace=TRUE),collapse="")
  randomStr4 <- sample(c("full", "+r+w+x", "+r+x"),3)
  
  datalist$volume[i] <- rep(sample(LETTERS[1:6]))[1]
  datalist$permissions[i] <- paste(c(randomStr1,randomStr2,randomStr3),randomStr4,sep = ': ',collapse = ' | ')
  
  i = i+1
}
dat <- data.frame(datalist)
View(dat)

I have create a WHILE loop that cycles through my merged data. I first use STRSPLIT to create a vector containing each vector element between “ | “ each pipe. Then I pass each vector element in a GREP command searching for a RegExp of (‘^[GD]’). If found it keeps the vector element and if more than one was found it will collapse the data back together between a semicolon and space (“; “)

Here is my current way of doing this.

  i <- 1
while (i <= length(dat$permissions)) {
  df <- strsplit(dat$permissions, " \\| |: ")[[i]]              #create a vector containing each vector element      
  dat$permissions[i] <- paste(df[grep('^[GD]', df)], collapse = "; ") #Only keep where starts with G or D then Paste together
  print(paste(i, " of ", length(dat$permissions), " ", dat$permissions[i]))
i = i + 1 }
View(dat)

After complete, I export to one .CSV file to complete my Transformation.

What would be a better way to process this data to drastically reduce the time it takes to process?



Solution 1:[1]

I think the key to speeding this up is to avoid looping over each row, when it can be done in a single vectorised operation for the strsplit and final paste operations.

paste(
  seq_along(dat$permissions), "of", nrow(dat), 
  lapply(strsplit(dat$permissions, " \\| |: "), 
         \(x) paste(x[grepl("^[GD]", x)], collapse="; "))
)

Should result in a ~250 times speedup:

system.time({
paste(
  seq_along(dat$permissions), "of", nrow(dat), 
  lapply(strsplit(dat$permissions, " \\| |: "), \(x) paste(x[grepl("^[GD]", x)], collapse="; "))
)
})
##   user  system elapsed 
##   0.03    0.00    0.03 

system.time({
  i <- 1
while (i <= length(dat$permissions)) {
  df <- strsplit(dat$permissions, " \\| |: ")[[i]]              #create a vector containing each vector element      
  dat$permissions[i] <- paste(df[grep('^[GD]', df)], collapse = "; ") #Only keep where starts with G or D then Paste together
  print(paste(i, " of ", length(dat$permissions), " ", dat$permissions[i]))
i = i + 1 }
})
##   user  system elapsed 
##   8.11    0.06    8.17

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 thelatemail