'In R, to check if phrases occur in free text field

I have a list of phrases like below ( in reality, there are many more).

phrase_1 <- c("covid19","coronavirus","ill","illness","pandemic")
phrase_2 <- c("e-mail","email","sent","attachment","recipient","data","signature","disclose")
phrase_3 <- c("it","issue","problem","server","network")

I also have a column in a dataframe (df) that has a lot of free text. Lets call this column "Comments" (df$Comments).

Overall objective is to check whether the phrases occur in the column or not.

For my purpose, a phrase is set to be present in the text when at least 3 of its elements are present. For example, if a text contains the words "covid19","coronavirus","ill", then I say that phrase_1 occurred in that text. If on the other hand it contains only "covid19", and "coronavirus", then the phrase does not occur.

My goal is to add 2 columns to df - "Number of phrases" and "Phrases that occurred".

I want to know how many of the phrases occurred in the text in column "Number of phrases", and the names of those phrases (for example, if a text contains the words "covid19","coronavirus","ill","it","issue","problem" - then value for df$'Number of phrases' = 2, and df$'Phrases that occurred'= "Phrase1", "Phrase3").

Below is my code that I tried, but somehow I think there must be a more efficient way to achieve this.

phrase_1 <- c("covid19","coronavirus","ill","illness","pandemic")
phrase_2 <- c("e-mail","email","sent","attachment","recipient","data","signature","disclose")
phrase_3 <- c("it","issue","problem","server","network")

phrase_list <- list(phrase_1,phrase_2,phrase_3)

phrase.tally.list <- data.frame()
percentage.table <-data.frame()

for(i in 1:length(phrase_list)){
  phrase <- phrase_list[[i]]
  for(j in 1:dim(df)[1]){
    x <- 0
    for(k in 1:length(phrase)){
      if(length(grep(pattern = phrase[[k]], x = df$Comments[j]))>0){x=x+1}else{x=x+0}
    }
    if(x>2){x=1}else{x=0} ## Phrase Count Threshold
    phrase.tally.list[j,i]<- x
    names(phrase.tally.list)[i]<-paste("Phrase",i,sep=" ")
  }
} ``





Solution 1:[1]

library(tidyverse)

phrase_1 <- c("covid19", "coronavirus", "ill", "illness", "pandemic")
phrase_2 <- c("e-mail", "email", "sent", "attachment", "recipient", "data", "signature", "disclose")
phrase_3 <- c("it", "issue", "problem", "server", "network")

search_corpus <- list(
  phrase_1 = phrase_1,
  phrase_2 = phrase_2,
  phrase_3 = phrase_3
) %>%
  enframe(value = "word") %>%
  unnest(word)


df <- tibble(
  id = seq(3),
  Comments = c("Lorem ipsum covid19",
     "Lorem ipsum it issue problem", "Lorem ipsum")
)

df %>%
  separate_rows(Comments, sep = " ") %>%
  inner_join(search_corpus, by = c("Comments" = "word"))
#> # A tibble: 4 × 3
#>      id Comments name    
#>   <int> <chr>    <chr>   
#> 1     1 covid19  phrase_1
#> 2     2 it       phrase_3
#> 3     2 issue    phrase_3
#> 4     2 problem  phrase_3

df %>%
  separate_rows(Comments, sep = " ") %>%
  inner_join(search_corpus, by = c("Comments" = "word")) %>%
  count(id, name)
#> # A tibble: 2 × 3
#>      id name         n
#>   <int> <chr>    <int>
#> 1     1 phrase_1     1
#> 2     2 phrase_3     3

Created on 2022-03-10 by the reprex package (v2.0.0)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 danlooo