'Filter rows on the condition that at least two distinct key words must be present
I have a dataframe with speech data, like this:
df <- data.frame(
id = 1:12,
partcl = c("yeah yeah yeah absolutely", "well you know it 's", "oh well yeah that's right",
"yeah I mean well oh", "well erm well Peter will be there", "well yeah well",
"yes yes yes totally", "yeah yeah yeah yeah", "well well I did n't do it",
"er well yeah that 's true", "oh hey where 's he gone?", "er"
))
and a vector with key words called parts:
parts <- c("yeah", "oh", "no", "well", "mm", "yes", "so", "right", "er", "like")
What I need to do is filter those rows with at least two distinct parts values. What I can do is filter those rows with at least two parts values, regardless of whether they're distinct or the same:
library(dplyr)
df %>%
filter(
str_count(partcl, paste0("\\b(", paste0(parts, collapse = "|"), ")\\b")) > 1
)
id partcl
1 1 yeah yeah yeah absolutely
2 3 oh well yeah that's right
3 4 yeah I mean well oh
4 5 well erm well Peter will be there
5 6 well yeah well
6 7 yes yes yes totally
7 8 yeah yeah yeah yeah
8 9 well well I did n't do it
9 10 er well yeah that 's true
How can I assert that the matched partsbe distinct so that the result is this:
id partcl
1 3 oh well yeah that's right
2 4 yeah I mean well oh
3 6 well yeah well
4 10 er well yeah that 's true
Solution 1:[1]
You can iterate over parts with sapply() to check df$partcl for occurrences of the keywords. The paste0("\\b", x, "\\b") part ensures that we only detect full words, otherwise "so" will also be found in "absolutely" for example. rowSums() creates a vector we can add to df and we can then dplyr::filter() the desired rows.
library(dplyr)
df$distinct_parts_count <-
sapply(parts, \(x) grepl(paste0("\\b", x, "\\b"), df$partcl)) |>
rowSums()
df |>
filter(distinct_parts_count >= 2)
#> id partcl distinct_parts_count
#> 1 3 oh well yeah that's right 4
#> 2 4 yeah I mean well oh 3
#> 3 6 well yeah well 2
#> 4 10 er well yeah that 's true 3
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Till |
