'R: keep the row if all values in group A are larger than all values in group B
I’m new to loops in R.
There is a count table for mRNA transcripts for control and test samples in triplicates.
gene1 <- c(100, 200, 300, 400, 500, 600)
gene2 <- c(600, 500, 400, 300, 200, 100)
gene3 <- c(100, 200, 400, 300, 500, 600)
data <- rbind(gene1, gene2, gene3)
colnames(data) <- c("control1", "control2","control3","test1","test2","test3")
data <- as.data.frame(data)
I would like to go through it and keep the gene (row) if all 3 “control” samples have larger numbers than the 3 “test” samples.
Then a new list should be made with those rows.
(There are more than 10000 rows in the real data set.)
I've tried the code below and all() function instead of min()/max() but it doesn't work.
control <- data[, c(1,2,3)]
test <- data[, c(4,5,6)]
for (i in 1:nrow(data)){
if(min(control)>max(test)){
list <- rbind(i, list)
}}
Thank you!
Solution 1:[1]
Here is a base R approach. The functions used are mostly self-explanatory, leave a comment below if you don't understand.
data[sapply(1:nrow(data), function(x)
data[x, which.min(data[x, 1:3])] > data[x, which.max(data[x, 4:6]) + 3])
, ]
Output
control1 control2 control3 test1 test2 test3
gene2 600 500 400 300 200 100
Solution 2:[2]
Here are 2 options, one with base R, one that's more verbose reshaping with tidyr and dplyr. Both allow you to work without hardcoding anything, and instead use regex to separate columns.
For the first, of course you can do the grep bit inside the apply call; I separated it out to be more clear.
library(dplyr)
control_cols <- grep("control", names(data))
test_cols <- grep("test", names(data))
data[apply(data[control_cols], 1, min) > apply(data[test_cols], 1, max), ]
#> control1 control2 control3 test1 test2 test3
#> gene2 600 500 400 300 200 100
More verbose, but possibly more flexible (e.g. if you had more types than just control & test, or if you had some other set of comparisons) is to reshape to have a column of all control values & a column of all test values, compare by gene, and reshape back.
data %>%
tibble::rownames_to_column("gene") %>%
tidyr::pivot_longer(-gene, names_to = c(".value", "num"),
names_pattern = "(^[a-z]+)(\\d+$)") %>%
group_by(gene) %>%
filter(min(control) > max(test)) %>%
tidyr::pivot_wider(names_from = num, values_from = c(control, test),
names_sep = "")
#> # A tibble: 1 × 7
#> # Groups: gene [1]
#> gene control1 control2 control3 test1 test2 test3
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 gene2 600 500 400 300 200 100
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | benson23 |
| Solution 2 |
