'In R, to check if phrases occur in free text field
I have a list of phrases like below ( in reality, there are many more).
phrase_1 <- c("covid19","coronavirus","ill","illness","pandemic")
phrase_2 <- c("e-mail","email","sent","attachment","recipient","data","signature","disclose")
phrase_3 <- c("it","issue","problem","server","network")
I also have a column in a dataframe (df) that has a lot of free text. Lets call this column "Comments" (df$Comments).
Overall objective is to check whether the phrases occur in the column or not.
For my purpose, a phrase is set to be present in the text when at least 3 of its elements are present. For example, if a text contains the words "covid19","coronavirus","ill", then I say that phrase_1 occurred in that text. If on the other hand it contains only "covid19", and "coronavirus", then the phrase does not occur.
My goal is to add 2 columns to df - "Number of phrases" and "Phrases that occurred".
I want to know how many of the phrases occurred in the text in column "Number of phrases", and the names of those phrases (for example, if a text contains the words "covid19","coronavirus","ill","it","issue","problem" - then value for df$'Number of phrases' = 2, and df$'Phrases that occurred'= "Phrase1", "Phrase3").
Below is my code that I tried, but somehow I think there must be a more efficient way to achieve this.
phrase_1 <- c("covid19","coronavirus","ill","illness","pandemic")
phrase_2 <- c("e-mail","email","sent","attachment","recipient","data","signature","disclose")
phrase_3 <- c("it","issue","problem","server","network")
phrase_list <- list(phrase_1,phrase_2,phrase_3)
phrase.tally.list <- data.frame()
percentage.table <-data.frame()
for(i in 1:length(phrase_list)){
phrase <- phrase_list[[i]]
for(j in 1:dim(df)[1]){
x <- 0
for(k in 1:length(phrase)){
if(length(grep(pattern = phrase[[k]], x = df$Comments[j]))>0){x=x+1}else{x=x+0}
}
if(x>2){x=1}else{x=0} ## Phrase Count Threshold
phrase.tally.list[j,i]<- x
names(phrase.tally.list)[i]<-paste("Phrase",i,sep=" ")
}
} ``
Solution 1:[1]
library(tidyverse)
phrase_1 <- c("covid19", "coronavirus", "ill", "illness", "pandemic")
phrase_2 <- c("e-mail", "email", "sent", "attachment", "recipient", "data", "signature", "disclose")
phrase_3 <- c("it", "issue", "problem", "server", "network")
search_corpus <- list(
phrase_1 = phrase_1,
phrase_2 = phrase_2,
phrase_3 = phrase_3
) %>%
enframe(value = "word") %>%
unnest(word)
df <- tibble(
id = seq(3),
Comments = c("Lorem ipsum covid19",
"Lorem ipsum it issue problem", "Lorem ipsum")
)
df %>%
separate_rows(Comments, sep = " ") %>%
inner_join(search_corpus, by = c("Comments" = "word"))
#> # A tibble: 4 × 3
#> id Comments name
#> <int> <chr> <chr>
#> 1 1 covid19 phrase_1
#> 2 2 it phrase_3
#> 3 2 issue phrase_3
#> 4 2 problem phrase_3
df %>%
separate_rows(Comments, sep = " ") %>%
inner_join(search_corpus, by = c("Comments" = "word")) %>%
count(id, name)
#> # A tibble: 2 × 3
#> id name n
#> <int> <chr> <int>
#> 1 1 phrase_1 1
#> 2 2 phrase_3 3
Created on 2022-03-10 by the reprex package (v2.0.0)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | danlooo |
