'wilcox.test() on many combinations of columns
I have a data.table with approximately 400 columns and 800,000 rows. The columns represent samples and the rows represent CpG sites. Example data here:
require(data.table)
samples <- replicate(200,replicate(1000,runif(1)))
cpgs <- paste0('cpg',1:1000)
n <- c('cpg',paste0('sample',1:200))
data <- data.table(cbind(cpgs,samples))
colnames(data) <- n
I want to run a wilcox.test() on randomly selected columns of this data 1000 times. I've currently implemented this the following way, but it's very slow on large numbers of permutations.
cases <- paste0('sample',1:10)
controls <- paste0('sample',30:40)
data[,wilcox_p:=wilcox.test( as.numeric(.SD[,mget(cases)]), as.numeric(.SD[,mget(controls)]) )$p.value,by=cpg]
Is there a more efficient way to do this? My complete use case, where getCpGSites() is the function described above, is here:
iterations_vec <- 1:1000
labels <- paste0('sample',1:200)
permutations <- foreach(i = iterations_vec, .combine='rbind', .multicombine = TRUE ) %dopar% {
case_labels <- sample(labels,num_cases,replace=FALSE)
control_labels <- labels[!labels %in% case_labels]
signature_cpgs <- getCpGSites(case_labels,control_labels)
num_signature_cpgs <- length(signature_cpgs)
out <- data.table('gene' = gene,
'iteration' = i,
'num_signature_cpgs' = num_signature_cpgs)
return(out)
}
Solution 1:[1]
Here's one approach, based on the tidyverse. First, convert all your character data tonumeric, rtaher than delegating to your function.
library(tidyverse)
numericData <- data %>% mutate(across(where(is.character), as.numeric))
Now write a function to perform a Wilcoxon test on a randomly selected pair of columns
randomWilcox <- function(d) {
cols <- sample(2:ncol(d), size=2, replace=FALSE)
d1 <- d %>% select(cpg, all_of(cols))
tibble(
col1=cols[1],
col2=cols[2],
p.value=wilcox.test(d1 %>% pull(2), d1 %>% pull(3))$p.value
)
}
Now use lapply to run the function 1000 times, with a very crude measure of speed:
startTime <- Sys.time()
lapply(1:1000, function(x) numericData %>% randomWilcox) %>% bind_rows()
endTime <- Sys.time()
# A tibble: 1,000 × 3
col1 col2 p.value
<int> <int> <dbl>
1 15 172 0.124
2 26 58 0.202
3 200 60 0.840
4 124 94 0.344
5 180 200 0.723
6 122 155 0.987
7 122 174 0.173
8 83 146 0.921
9 135 95 0.0605
10 168 174 0.0206
# … with 990 more rows
Each row of the output tibble contains the indices of the columns selected, and the p-value obtained from corresponding wilcox.test.
The time taken is about 13 seconds on my machine. Is that quick enough?
endTime - startTime
Time difference of 13.1156 secs
Edit
Removing the intermediate data frame reduces the time taken to just over none seconds:
randomWilcox <- function(d) {
cols <- sample(2:ncol(d), size=2, replace=FALSE)
tibble(
col1=cols[1],
col2=cols[2],
p.value=wilcox.test(d %>% pull(cols[1]), d %>% pull(cols[2]))$p.value
)
}
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Limey |
