'Extract strings based on a database in R dplyr

From my data I want to extract the strings that are between the L and R string from my database.

My database includes 4 different L and R string combinations and I want to test all of them.

One way is to write a for loop, but is there any more elegant and clever way?

library(tidyverse)

data <-  c("CCACGAAGCTCTCCTACGTACGGTTATATTGACAGACCGAGGGCAGTCCAGCGCCAACCAGATAAGTGAAATCTAGTTCCA",
          "CCACGAAGCTCTCCTACGTACGGTTATATTGACAGACCGAGGGCAGTCCAGCGCCAACCAGATAAGTGAAATCTAGTTCCA",
          "CCACGAAGCTCTCCTAGGGGGGGGCTATTTTGGACTGCGTTACCAGTCCAGCGCCAACCAGATAAGTGGAATCTAGTTCGA",
          "CCACGTAGCTCTCCTCCGTGCGGTTATATTGACAGACCGAGGGCAGTCCAGCGCCAACCAGATAAGTGAAATCTAGTTCCA") %>% 
  as.data.frame() %>% 
  rename(seq=1)

database=data.frame(L=c("CTACG","CTAGG","CTCCG"), R=c("CAGTC","CAGTC","CAGTC"))


data %>% 
  mutate(extracts= str_extract(.$seq,
  str_c("(?<=",str_c(database[1,1], collapse = ""),").*(?=",str_c(database[1,2], collapse = ""),")")))
#>                                                                                 seq
#> 1 CCACGAAGCTCTCCTACGTACGGTTATATTGACAGACCGAGGGCAGTCCAGCGCCAACCAGATAAGTGAAATCTAGTTCCA
#> 2 CCACGAAGCTCTCCTACGTACGGTTATATTGACAGACCGAGGGCAGTCCAGCGCCAACCAGATAAGTGAAATCTAGTTCCA
#> 3 CCACGAAGCTCTCCTAGGGGGGGGCTATTTTGGACTGCGTTACCAGTCCAGCGCCAACCAGATAAGTGGAATCTAGTTCGA
#> 4 CCACGTAGCTCTCCTCCGTGCGGTTATATTGACAGACCGAGGGCAGTCCAGCGCCAACCAGATAAGTGAAATCTAGTTCCA
#>                    extracts
#> 1 TACGGTTATATTGACAGACCGAGGG
#> 2 TACGGTTATATTGACAGACCGAGGG
#> 3                      <NA>
#> 4                      <NA>

Created on 2022-02-01 by the reprex package (v2.0.1)



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source