'Replace Nth occurrence of a word (substring) in a string in R, N is the value of an integer column
I want to find the Nth occurence of a word in an utterance and put [brackets] around it. I tried with various things but I think the closest I'm getting is with gsub but I can't have {copy-1} for the number of times in my regex. Any ideas? Can we put a variable in there? Thanks!
#creating my df
utterance <- c("we are not who we think we are", "they know who we are")
df <- data.frame(utterance)
df$occurences = str_count(df$utterance, "we")
df <- df %>% mutate(ID = row_number())
df <- df %>% uncount(occurences) %>% group_by(ID) %>% mutate(copy = row_number())
#this is my gsub
gsub("((?:we){copy-1}.*)we", "\\[we\\]", df$utterance)
This would be my result
utterance ID copy
<chr> <int> <int>
1 [we] are not who we think we are 1 1
2 we are not who [we] think we are 1 2
3 we are not who we think [we] are 1 3
4 they know who [we] are 2 1
Solution 1:[1]
How about just this:
library(tidyverse)
f <- function(s,c,target) {
g = gregexpr(target,s)[[1]][c]
if(is.na(g) | g<0) return(s)
paste0(str_sub(s,1,g-1),"[",target,"]",str_sub(s,1+g+length(target)))
}
df %>% rowwise() %>% mutate(utterance = f(utterance,copy, "we"))
Output:
utterance ID copy
<chr> <int> <int>
1 [we] are not who we think we are 1 1
2 we are not who [we] think we are 1 2
3 we are not who we think [we] are 1 3
4 they know who [we] are 2 1
Note that this will also find targets that are not whole words. For example the second of occurrence of "we" in "We went where we went yesterday" is the first two letters of "went", not the second occurrence of the word "we". If you want to restrict to whole words, you can update the gregexpr() call to this:
g = gregexpr(paste0("\\b",target, "\\b"),s)[[1]][c]
Solution 2:[2]
Here is a string splitting approach. We can split the input string on we, and then piece together, using [we] as the nth connector.
repn <- function(x, find, repl, n) {
parts <- strsplit(x, paste0("\\b", find, "\\b"))[[1]]
output <- paste0(
paste0(parts[1:n], collapse=find),
repl,
paste0(parts[(n+1):length(parts)], collapse="we")
)
return(output)
}
x <- "we are not who we think we are"
repn(x, "we", "[we]", 1)
repn(x, "we", "[we]", 2)
repn(x, "we", "[we]", 3)
[1] "[we] are not who we think we are"
[1] "we are not who [we] think we are"
[1] "we are not who we think [we] are"
Solution 3:[3]
Here's a mixed approach using a number of additional packages:
library(data.table)
library(tibble)
library(dplyr)
library(tidyr)
df %>%
rowid_to_column() %>%
separate_rows(utterance, sep = " ") %>%
group_by(rowid) %>%
mutate(wordcount = ifelse(utterance == "we", rleid(rowid), NA), # simpler: wordcount = ifelse(utterance == "we", 1, NA)
wordcount = cumsum(!is.na(wordcount))) %>%
mutate(utterance = ifelse(utterance == "we" & wordcount == copy, paste0("[", utterance, "]"), utterance)) %>%
summarise(utterance = paste0(utterance, collapse = " ")) %>%
bind_cols(.,df[,2:3])
# A tibble: 4 × 4
rowid utterance ID copy
<int> <chr> <int> <int>
1 1 [we] are not who we think we are 1 1
2 2 we are not who [we] think we are 1 2
3 3 we are not who we think [we] are 1 3
4 4 they know who [we] are 2 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Tim Biegeleisen |
| Solution 3 |
