'Extract strings based on a database in R dplyr
From my data I want to extract the strings that are between the L and R string from my database.
My database includes 4 different L and R string combinations and I want to test all of them.
One way is to write a for loop, but is there any more elegant and clever way?
library(tidyverse)
data <- c("CCACGAAGCTCTCCTACGTACGGTTATATTGACAGACCGAGGGCAGTCCAGCGCCAACCAGATAAGTGAAATCTAGTTCCA",
"CCACGAAGCTCTCCTACGTACGGTTATATTGACAGACCGAGGGCAGTCCAGCGCCAACCAGATAAGTGAAATCTAGTTCCA",
"CCACGAAGCTCTCCTAGGGGGGGGCTATTTTGGACTGCGTTACCAGTCCAGCGCCAACCAGATAAGTGGAATCTAGTTCGA",
"CCACGTAGCTCTCCTCCGTGCGGTTATATTGACAGACCGAGGGCAGTCCAGCGCCAACCAGATAAGTGAAATCTAGTTCCA") %>%
as.data.frame() %>%
rename(seq=1)
database=data.frame(L=c("CTACG","CTAGG","CTCCG"), R=c("CAGTC","CAGTC","CAGTC"))
data %>%
mutate(extracts= str_extract(.$seq,
str_c("(?<=",str_c(database[1,1], collapse = ""),").*(?=",str_c(database[1,2], collapse = ""),")")))
#> seq
#> 1 CCACGAAGCTCTCCTACGTACGGTTATATTGACAGACCGAGGGCAGTCCAGCGCCAACCAGATAAGTGAAATCTAGTTCCA
#> 2 CCACGAAGCTCTCCTACGTACGGTTATATTGACAGACCGAGGGCAGTCCAGCGCCAACCAGATAAGTGAAATCTAGTTCCA
#> 3 CCACGAAGCTCTCCTAGGGGGGGGCTATTTTGGACTGCGTTACCAGTCCAGCGCCAACCAGATAAGTGGAATCTAGTTCGA
#> 4 CCACGTAGCTCTCCTCCGTGCGGTTATATTGACAGACCGAGGGCAGTCCAGCGCCAACCAGATAAGTGAAATCTAGTTCCA
#> extracts
#> 1 TACGGTTATATTGACAGACCGAGGG
#> 2 TACGGTTATATTGACAGACCGAGGG
#> 3 <NA>
#> 4 <NA>
Created on 2022-02-01 by the reprex package (v2.0.1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
