'R: keep the row if all values in group A are larger than all values in group B

I’m new to loops in R.
There is a count table for mRNA transcripts for control and test samples in triplicates.

gene1 <- c(100, 200, 300, 400, 500, 600)
gene2 <- c(600, 500, 400, 300, 200, 100)
gene3 <- c(100, 200, 400, 300, 500, 600)
data <- rbind(gene1, gene2, gene3)
colnames(data) <- c("control1", "control2","control3","test1","test2","test3")
data <- as.data.frame(data)

I would like to go through it and keep the gene (row) if all 3 “control” samples have larger numbers than the 3 “test” samples.
Then a new list should be made with those rows.
(There are more than 10000 rows in the real data set.)

I've tried the code below and all() function instead of min()/max() but it doesn't work.

control <- data[, c(1,2,3)]
test <- data[, c(4,5,6)]

for (i in 1:nrow(data)){
   if(min(control)>max(test)){
    list <- rbind(i, list)
}}

Thank you!

r


Solution 1:[1]

Here is a base R approach. The functions used are mostly self-explanatory, leave a comment below if you don't understand.

data[sapply(1:nrow(data), function(x)
  data[x, which.min(data[x, 1:3])] > data[x, which.max(data[x, 4:6]) + 3])
  , ]

Output

     control1 control2 control3 test1 test2 test3
gene2      600      500      400   300   200   100

Solution 2:[2]

Here are 2 options, one with base R, one that's more verbose reshaping with tidyr and dplyr. Both allow you to work without hardcoding anything, and instead use regex to separate columns.

For the first, of course you can do the grep bit inside the apply call; I separated it out to be more clear.

library(dplyr)

control_cols <- grep("control", names(data))
test_cols <- grep("test", names(data))

data[apply(data[control_cols], 1, min) > apply(data[test_cols], 1, max), ]
#>       control1 control2 control3 test1 test2 test3
#> gene2      600      500      400   300   200   100

More verbose, but possibly more flexible (e.g. if you had more types than just control & test, or if you had some other set of comparisons) is to reshape to have a column of all control values & a column of all test values, compare by gene, and reshape back.

data %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = c(".value", "num"), 
                      names_pattern = "(^[a-z]+)(\\d+$)") %>%
  group_by(gene) %>%
  filter(min(control) > max(test)) %>%
  tidyr::pivot_wider(names_from = num, values_from = c(control, test), 
                     names_sep = "")
#> # A tibble: 1 × 7
#> # Groups:   gene [1]
#>   gene  control1 control2 control3 test1 test2 test3
#>   <chr>    <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl>
#> 1 gene2      600      500      400   300   200   100

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 benson23
Solution 2