'How can I find common rows between two dataframes based on two different numeric columns?
I have two different data frames that I want to find common rows between them according to two different numeric columns (these columns are the location of the genes in genome, start and end), but I don't want to be completely match based on those two columns and I want to find commons with (-+) ranges.
Actually I want to find common rows even they are with for example 200 bp or 500 bp differences in start and end locations. I am beginner in R and I could not find a way to do this.
In df2 although DEC1 has different START location but its difference is 400 bp with DEC1 in df1, so i want to consider this gene as a common between df1 and 2 but PSA and AKT have difference in START more than 500bp and I do not need them, although they have same END position between df1 and 2.
df1 <- data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c("9494957", "39689186", "89435677", "78484829"), END = c("52521320", "114050940", "100952138", "78486308"), STRAND = c("+", "+", "+", "-"))
df2 <- data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c("9494557", "37689186", "89435677", "79484829"), END = c("52521320", "114050940", "100952138", "78486308"), STRAND = c("+", "+", "+", "-"))
Solution 1:[1]
To my knowledge, merge or join would be the only way to compare two columns.
Using data.table,
require(data.table)
#> Loading required package: data.table
df1 <- setDT(data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c(9494957, 39689186, 89435677, 78484829), END = c(52521320, 114050940, 100952138, 78486308), STRAND = c("+", "+", "+", "-")))
df2 <- setDT(data.frame(name = c("DEC1", "PSA", "DEC2", "AKT"), START = c(9494557, 37689186, 89435677, 79484829), END = c(52521320, 114050940, 100952138, 78486308), STRAND = c("+", "+", "+", "-")))
df3 <- df1[df2[,.(name,START2=START, END2 = END)], on='name']
df3[abs(START2-START) %between% c(0,500) |
abs(START2-START) %between% c(0,500)]
#> name START END STRAND START2 END2
#> 1: DEC1 9494957 52521320 + 9494557 52521320
#> 2: DEC2 89435677 100952138 + 89435677 100952138
Or using dplyr,
df3 <- inner_join(df1, df2, suffix=c('1','2'),by='name')
df3 %>% filter(abs(START2-START1)<500)
Created on 2022-04-30 by the reprex package (v2.0.1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
