'wilcox.test() on many combinations of columns

I have a data.table with approximately 400 columns and 800,000 rows. The columns represent samples and the rows represent CpG sites. Example data here:

require(data.table)
samples <- replicate(200,replicate(1000,runif(1)))
cpgs <- paste0('cpg',1:1000)
n <- c('cpg',paste0('sample',1:200))
data <- data.table(cbind(cpgs,samples))
colnames(data) <- n

I want to run a wilcox.test() on randomly selected columns of this data 1000 times. I've currently implemented this the following way, but it's very slow on large numbers of permutations.

cases <- paste0('sample',1:10)
controls <- paste0('sample',30:40)
data[,wilcox_p:=wilcox.test( as.numeric(.SD[,mget(cases)]), as.numeric(.SD[,mget(controls)]) )$p.value,by=cpg]

Is there a more efficient way to do this? My complete use case, where getCpGSites() is the function described above, is here:

iterations_vec <- 1:1000
labels <- paste0('sample',1:200)

permutations <- foreach(i = iterations_vec, .combine='rbind', .multicombine = TRUE ) %dopar% {
    
    case_labels <- sample(labels,num_cases,replace=FALSE)
    control_labels <- labels[!labels %in% case_labels]
    
    signature_cpgs <- getCpGSites(case_labels,control_labels)
    num_signature_cpgs <- length(signature_cpgs)
    
    out <- data.table('gene' = gene,
                      'iteration' = i,
                      'num_signature_cpgs' = num_signature_cpgs)
    return(out)
    
  }


Solution 1:[1]

Here's one approach, based on the tidyverse. First, convert all your character data tonumeric, rtaher than delegating to your function.

library(tidyverse)

numericData <- data %>% mutate(across(where(is.character), as.numeric))

Now write a function to perform a Wilcoxon test on a randomly selected pair of columns

randomWilcox <- function(d) {
  cols <- sample(2:ncol(d), size=2, replace=FALSE)
  d1 <- d %>% select(cpg, all_of(cols))
  tibble(
    col1=cols[1],
    col2=cols[2],
    p.value=wilcox.test(d1 %>% pull(2), d1 %>% pull(3))$p.value
  )
}

Now use lapply to run the function 1000 times, with a very crude measure of speed:

startTime <- Sys.time()
lapply(1:1000, function(x) numericData %>% randomWilcox) %>% bind_rows()
endTime <- Sys.time()
# A tibble: 1,000 × 3
    col1  col2 p.value
   <int> <int>   <dbl>
 1    15   172  0.124 
 2    26    58  0.202 
 3   200    60  0.840 
 4   124    94  0.344 
 5   180   200  0.723 
 6   122   155  0.987 
 7   122   174  0.173 
 8    83   146  0.921 
 9   135    95  0.0605
10   168   174  0.0206
# … with 990 more rows

Each row of the output tibble contains the indices of the columns selected, and the p-value obtained from corresponding wilcox.test.

The time taken is about 13 seconds on my machine. Is that quick enough?

endTime - startTime
Time difference of 13.1156 secs

Edit

Removing the intermediate data frame reduces the time taken to just over none seconds:

randomWilcox <- function(d) {
  cols <- sample(2:ncol(d), size=2, replace=FALSE)
  tibble(
    col1=cols[1],
    col2=cols[2],
    p.value=wilcox.test(d %>% pull(cols[1]), d %>% pull(cols[2]))$p.value
  )
}

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Limey