'Remove pattern that occurs outside of words
I am trying to remove pattern 'SO' from the end of a character vector. The issue I run into with the below code is that it will remove any sequence of 'SO' case insensitive/just removes the whole string (vs. last pattern detected). One solution I had was to do some manual cleaning and force all to lower with the exception of final 'SO' and leaving it case sensitive.
x <- data.frame(y = c("Solutions are welcomed, please SO # 12345")
x <- x %>% mutate(y = stri_replace_last_regex(x$y,"SO.*","",case_insensitive = TRUE)) # This will remove the string entirely - I'm not really sure why.
The desired output is:
'Solutions are welcomed, please'
I have used an iteration of regex that looks like \\b\\SO{2}\\b and \\b\\D{2}*\\b|[[:punct:]] - I believe the answer could lie here by setting word boundaries but I am not sure. The second one gets rid of the SO but I feel if there are so letters in sequence elsewhere separate from wording that would get removed as well. I just need the last occurrence of SO and everything after to be removed including punctuation in the whole string.
Any guidance on this would come much appreciated to me.
Solution 1:[1]
You can use gsub to remove the pattern you don't want.
gsub("\\sSO.+$", "", x$y)
[1] "Solutions are welcomed, please"
Use [[:upper:]]{2} if you want to generalise to any two consecutive upper case letters.
gsub("\\s[[:upper:]]{2}.+$", "", x$y)
[1] "Solutions are welcomed, please"
UPDATE: the above code might not be accurate if you have more than one "SO" in the string
To demonstrate, I have created another string with multiple "SO". Here, we are capturing any characters from the start of the string (^), until before the last occurrence of "SO" (SO.+$). These strings are stored in the first capture group (it's the regex (.*)). Then we can use gsub to replace the entire string with the first capture group (\\1), thus getting rid of everything that is after the last occurrence of "SO".
x <- data.frame(y = "Solutions are SO welcomed, SO please SO # 12345")
gsub('^(.*)SO.+$', '\\1', x$y)
[1] "Solutions are SO welcomed, SO please "
Solution 2:[2]
library(dplyr)
library(stringr)
x %>%
mutate(y = str_replace_all(y, 'SO.*', ''))
or
library(dplyr)
library(stringr)
x %>%
mutate(y = str_replace_all(y, 'SO\\s\\#\\s\\d*', ''))
output:
y
1 Solutions are welcomed, please
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | TarJae |
