'Use a character vector to build a new dataframe containing rows in which strings from that vector are found
I want to query a database using a character vector query_list, and return a dataframe query_output. In this two-column output dataframe, each row corresponds to a single string from the query vector. First of the two columns in the output dataframe (called term) names this string, and the second column (called enzyme) lists all rows of the database in which the query string was found, defined by the column enzyme from the database).
My query and database look as follows:
query_list <- c("term(A)", "term(B)", "term(C)", "term(D15)")
database <- data.frame(enzyme = c("A1", "B1", "C1", "D1", "E1")
,term = c("term(A);term(K);term(Y);term(G);term(F);"
,"term(A);term(K);term(Y);term(G);term(F);"
,"term(H);term(K);term(Y);term(C);term(F);"
,"term(H);term(B);term(Y);term(C);term(F);"
,"term(H);term(K);term(D15);term(G);term(F);"))
the database looks like this:
enzyme term
1 A1 term(A);term(K);term(Y);term(G);term(F);
2 B1 term(A);term(K);term(Y);term(G);term(F);
3 C1 term(H);term(K);term(Y);term(C);term(F);
4 D1 term(H);term(B);term(Y);term(C);term(F);
5 E1 term(H);term(K);term(D15);term(G);term(F);
The resulting dataframe query_output:
> query_output
term enzyme
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
Optimally, the solution would be pipeable, and not a loop (although anything will be appreciated). I don't say what I have tried because I don't really know how to go about it in a concise way.
Solution 1:[1]
Using separate_rows() from tidyr package you can separate the values in term. Then, just filter by your query_list, group by term and use paste0(..., collapse=';') to collapse all values for each term in the same row.
database %>%
tidyr::separate_rows(term,sep=";") %>%
filter(term %in% query_list) %>%
group_by(term) %>%
summarise(enzyme = paste0(enzyme,collapse=', '))
Output:
# A tibble: 4 x 2
term enzyme
<chr> <chr>
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
Solution 2:[2]
Try this:
library(tidyr)
database %>%
mutate(term = sub(";$", "", term)) %>%
separate_rows(term, sep = ";") %>%
filter(term %in% query_list) %>%
group_by(term) %>%
summarise(enzyme = toString(enzyme))
# A tibble: 4 × 2
term enzyme
<chr> <chr>
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
Solution 3:[3]
You can transform your database to your desired format
library(dplyr)
library(tidyr)
transformed_database <- database %>%
separate_rows(term, sep = ';') %>%
filter(term != '') %>%
group_by(term) %>%
summarise(enzyme = paste0(enzyme, collapse = ', '))
transformed_database
#> # A tibble: 9 × 2
#> term enzyme
#> <chr> <chr>
#> 1 term(A) A1, B1
#> 2 term(B) D1
#> 3 term(C) C1, D1
#> 4 term(D15) E1
#> 5 term(F) A1, B1, C1, D1, E1
#> 6 term(G) A1, B1, E1
#> 7 term(H) C1, D1, E1
#> 8 term(K) A1, B1, C1, E1
#> 9 term(Y) A1, B1, C1, D1
Then, querying it is as simple as
transformed_database %>%
filter(term %in% query_list)
#> # A tibble: 4 × 2
#> term enzyme
#> <chr> <chr>
#> 1 term(A) A1, B1
#> 2 term(B) D1
#> 3 term(C) C1, D1
#> 4 term(D15) E1
Solution 4:[4]
We can iterate through the query_list in base R, and use enflame from tibble to make it a dataframe.
library(tibble)
enframe(sapply(query_list, function(x)
paste(database[grepl(x, strsplit(database$term, ";"), fixed = T), 1], collapse = ", ")),
name = "term",
value = "enzyme")
# A tibble: 4 × 2
term enzyme
<chr> <chr>
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
Solution 5:[5]
My solution is:
database %>%
separate_rows(term, sep = ";") %>%
filter(term != "") %>%
filter(term %in% query_list) %>%
print() %>%
group_by(term) %>%
summarise(enzyme = str_c(enzyme, collapse = ", ")) %>%
ungroup()
Which results in
# A tibble: 4 × 2
term enzyme
<chr> <chr>
1 term(A) A1, B1
2 term(B) D1
3 term(C) C1, D1
4 term(D15) E1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | RobertoT |
| Solution 2 | Chris Ruehlemann |
| Solution 3 | Aron |
| Solution 4 | benson23 |
| Solution 5 | Mossa |
