'Merging data frames by selecting for correct value

I have a data frame called "ref" that contains information that allows mapping of gene entrez ID to the gene's start and end positions. I have another data frame "ori_data" where each row contains unique mutations from samples, which gives a genomic position. I am trying to assign each position given in "ori_data" to map to information on "ref" in order to assign entrez ID to each mutation. I have tried a for loop to match for the same chromosome, and then select for positions in "ori_data" that fall between the coordinates in "ref" though I have not been successful. The "ori_data" dataset is over 1 million rows, so I'm not sure a for loop is an efficient solution. Note that many positions will be mapped to the same entrez ID in my real dataset. "Final" is what I want to happen- which would just add a column for entrezID according to chromosome/position. TYIA!

ref = data.frame("EntrezID" = c(1, 10, 100, 1000), "Chromosome" = c("19", "8", "20", "18"), "txStarts" = c("58345182", "18391281", "44619518", "27950965"), "txEnds" = c("58353492", "18401215", "44651758", "28177130"))

ori_data = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
             "Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"))

final = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
               "Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"), "EntrezID" = c(1,10,100,1000))

I have tried this line of code and I'm unsure as to why it does not work.

for (i in 1:dim(ori_data)[1])
{
  for (j in 1:dim(ref)[1])
  {
    ID = which(ori_data[i, "Chromosome"] == ref[j, 
     "Chromosome"])
    if (length(ID) > 0)
    {
      Pos = ori_data[ID, "POS"]
      IDj = which(Pos >= ref[j, "txStarts"] & Pos <= 
           ref[j, "txEnds"])
      print(IDj)
      if (length(IDj) > 0)
       {
        ori_data = cbind("Entrez" = ref[IDj, 
                  "EntrezID"], ori_data)
     }
   }
 }
}
r


Solution 1:[1]

In base apply could be used to find matches per row for Chromosome and test if Pos is in the range of txStarts txEnds.

ori_data$EntrezID <- apply(ori_data[c("Chromosome", "Pos")], 1, \(x)
  ref$EntrezID[ref$Chromosome == x["Chromosome"] &
    x["Pos"] >= ref$txStarts & x["Pos"] <= ref$txEnds][1])
ori_data
#  Chromosome      Pos Sample EntrezID
#1         19 58345186   HCC1        1
#2          8 18401213   HCC2       10
#3         20 44619519   HCC1      100
#4         18 27950966   HCC3     1000

A version which could be faster:

lup <- list2env(split(ref[c("EntrezID", "txStarts", "txEnds")], ref$Chromosome))
ori_data$EntrezID <- Map(\(x, y) {
  . <- get(x, envir=lup)
  .$EntrezID[y >= .$txStarts & y <= .$txEnds][1]
}, ori_data$Chromosome, ori_data$Pos)

Or another way but not keeping the original order. (If original order is important, have a look at unsplit.)

#Assuming you have many rows with same Chromosome
x <- split(ori_data, ori_data$Chromosome)

#Assuming you have also here many rows with same Chromosome
lup <- split(ref[c("EntrezID", "txStarts", "txEnds")], ref$Chromosome)

#Now I am soting this by the names of x - try which Method ist faster
#Method 1:
lup <- lup[names(x)]
#Method 2:
lup <- mget(names(x), list2env(lup))

res <- do.call(rbind, Map(\(a, b) {
  cbind(a, b[1][a$Pos >= b[[2]] &  a$Pos <= b[[3]]][1])
}, x, lup))

Solution 2:[2]

One option would be to use sqldf, which should also be efficient for a large dataframe.

library(tibble)
library(sqldf)

as_tibble(sqldf("select dna.*, ref.EntrezID from dna
                join ref on dna.Pos > ref.'txStarts' and 
                dna.Pos < ref.'txEnds'"))

Another option using fuzzy_join:

library(dplyr)
library(fuzzyjoin)

dna %>%
  fuzzy_join(ref %>% select(-Chromosome), by = c("Pos" = "txStarts", "Pos" = "txEnds"),
             match_fun = list(`>`, `<`)) %>%
  select(names(dna), EntrezID)

Output

  Chromosome      Pos Sample EntrezID
1         19 58345186   HCC1        1
2          8 18401213   HCC2       10
3         20 44619519   HCC1      100
4         18 27950966   HCC3     1000

Solution 3:[3]

If the 'Pos', 'txStarts', 'txEnds' are numeric, then we can use non-equi join

library(data.table)
setDT(dna)[ref, EntrezID := i.EntrezID, 
    on = .(Chromosome, Pos > txStarts, Pos <txEnds)]

-output

> dna
   Chromosome      Pos Sample EntrezID
       <char>    <num> <char>    <num>
1:         19 58345186   HCC1        1
2:          8 18401213   HCC2       10
3:         20 44619519   HCC1      100
4:         18 27950966   HCC3     1000

data

dna <- type.convert(dna, as.is = TRUE)
ref <- type.convert(ref, as.is = TRUE)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3