'Use a character vector to build a new dataframe containing rows in which strings from that vector are found

I want to query a database using a character vector query_list, and return a dataframe query_output. In this two-column output dataframe, each row corresponds to a single string from the query vector. First of the two columns in the output dataframe (called term) names this string, and the second column (called enzyme) lists all rows of the database in which the query string was found, defined by the column enzyme from the database).

My query and database look as follows:

query_list <- c("term(A)", "term(B)", "term(C)", "term(D15)")
database <- data.frame(enzyme = c("A1", "B1", "C1", "D1", "E1")
                       ,term = c("term(A);term(K);term(Y);term(G);term(F);"
                                 ,"term(A);term(K);term(Y);term(G);term(F);"
                                 ,"term(H);term(K);term(Y);term(C);term(F);"
                                 ,"term(H);term(B);term(Y);term(C);term(F);"
                                 ,"term(H);term(K);term(D15);term(G);term(F);"))

the database looks like this:

  enzyme                                       term
1     A1   term(A);term(K);term(Y);term(G);term(F);
2     B1   term(A);term(K);term(Y);term(G);term(F);
3     C1   term(H);term(K);term(Y);term(C);term(F);
4     D1   term(H);term(B);term(Y);term(C);term(F);
5     E1   term(H);term(K);term(D15);term(G);term(F);

The resulting dataframe query_output:

> query_output 
       term  enzyme
1   term(A)  A1, B1
2   term(B)      D1
3   term(C)  C1, D1
4 term(D15)      E1

Optimally, the solution would be pipeable, and not a loop (although anything will be appreciated). I don't say what I have tried because I don't really know how to go about it in a concise way.



Solution 1:[1]

Using separate_rows() from tidyr package you can separate the values in term. Then, just filter by your query_list, group by term and use paste0(..., collapse=';') to collapse all values for each term in the same row.

database %>% 
  tidyr::separate_rows(term,sep=";") %>% 
  filter(term %in% query_list) %>%
  group_by(term) %>% 
  summarise(enzyme = paste0(enzyme,collapse=', '))

Output:

# A tibble: 4 x 2
  term      enzyme
  <chr>     <chr> 
1 term(A)   A1, B1 
2 term(B)   D1    
3 term(C)   C1, D1 
4 term(D15) E1   

Solution 2:[2]

Try this:

library(tidyr)
database %>%
  mutate(term = sub(";$", "", term)) %>%
  separate_rows(term, sep = ";") %>%
  filter(term %in% query_list) %>%
  group_by(term) %>%
  summarise(enzyme = toString(enzyme))
# A tibble: 4 × 2
  term      enzyme
  <chr>     <chr> 
1 term(A)   A1, B1
2 term(B)   D1    
3 term(C)   C1, D1
4 term(D15) E1  

Solution 3:[3]

You can transform your database to your desired format

library(dplyr)
library(tidyr)

transformed_database <- database %>%
  separate_rows(term, sep = ';') %>%
  filter(term != '') %>%
  group_by(term) %>%
  summarise(enzyme = paste0(enzyme, collapse = ', '))

transformed_database
#> # A tibble: 9 × 2
#>   term      enzyme            
#>   <chr>     <chr>             
#> 1 term(A)   A1, B1            
#> 2 term(B)   D1                
#> 3 term(C)   C1, D1            
#> 4 term(D15) E1                
#> 5 term(F)   A1, B1, C1, D1, E1
#> 6 term(G)   A1, B1, E1        
#> 7 term(H)   C1, D1, E1        
#> 8 term(K)   A1, B1, C1, E1    
#> 9 term(Y)   A1, B1, C1, D1

Then, querying it is as simple as

transformed_database %>%
  filter(term %in% query_list)

#> # A tibble: 4 × 2
#>   term      enzyme
#>   <chr>     <chr> 
#> 1 term(A)   A1, B1
#> 2 term(B)   D1    
#> 3 term(C)   C1, D1
#> 4 term(D15) E1

Solution 4:[4]

We can iterate through the query_list in base R, and use enflame from tibble to make it a dataframe.

library(tibble)

enframe(sapply(query_list, function(x)
  paste(database[grepl(x, strsplit(database$term, ";"), fixed = T), 1], collapse = ", ")),
  name = "term",
  value = "enzyme")

# A tibble: 4 × 2
  term      enzyme
  <chr>     <chr> 
1 term(A)   A1, B1
2 term(B)   D1    
3 term(C)   C1, D1
4 term(D15) E1     

Solution 5:[5]

My solution is:

database %>% 
  separate_rows(term, sep = ";") %>% 
  filter(term != "") %>%
  filter(term %in% query_list) %>% 
  print() %>% 
  group_by(term) %>% 
  summarise(enzyme = str_c(enzyme, collapse = ", ")) %>% 
  ungroup()

Which results in

# A tibble: 4 × 2
  term      enzyme
  <chr>     <chr> 
1 term(A)   A1, B1
2 term(B)   D1    
3 term(C)   C1, D1
4 term(D15) E1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 RobertoT
Solution 2 Chris Ruehlemann
Solution 3 Aron
Solution 4 benson23
Solution 5 Mossa