'Remove pattern that occurs outside of words

I am trying to remove pattern 'SO' from the end of a character vector. The issue I run into with the below code is that it will remove any sequence of 'SO' case insensitive/just removes the whole string (vs. last pattern detected). One solution I had was to do some manual cleaning and force all to lower with the exception of final 'SO' and leaving it case sensitive.

x <- data.frame(y = c("Solutions are welcomed, please SO # 12345")

x <- x %>% mutate(y = stri_replace_last_regex(x$y,"SO.*","",case_insensitive = TRUE)) # This will remove the string entirely - I'm not really sure why.  

The desired output is:

'Solutions are welcomed, please'

I have used an iteration of regex that looks like \\b\\SO{2}\\b and \\b\\D{2}*\\b|[[:punct:]] - I believe the answer could lie here by setting word boundaries but I am not sure. The second one gets rid of the SO but I feel if there are so letters in sequence elsewhere separate from wording that would get removed as well. I just need the last occurrence of SO and everything after to be removed including punctuation in the whole string.

Any guidance on this would come much appreciated to me.



Solution 1:[1]

You can use gsub to remove the pattern you don't want.

gsub("\\sSO.+$", "", x$y)

[1] "Solutions are welcomed, please"

Use [[:upper:]]{2} if you want to generalise to any two consecutive upper case letters.

gsub("\\s[[:upper:]]{2}.+$", "", x$y)

[1] "Solutions are welcomed, please"

UPDATE: the above code might not be accurate if you have more than one "SO" in the string

To demonstrate, I have created another string with multiple "SO". Here, we are capturing any characters from the start of the string (^), until before the last occurrence of "SO" (SO.+$). These strings are stored in the first capture group (it's the regex (.*)). Then we can use gsub to replace the entire string with the first capture group (\\1), thus getting rid of everything that is after the last occurrence of "SO".

x <- data.frame(y = "Solutions are SO welcomed, SO please SO # 12345")

gsub('^(.*)SO.+$', '\\1', x$y)

[1] "Solutions are SO welcomed, SO please "

Solution 2:[2]

library(dplyr)
library(stringr)

x %>% 
  mutate(y = str_replace_all(y, 'SO.*', ''))

or

library(dplyr)
library(stringr)

x %>% 
  mutate(y = str_replace_all(y, 'SO\\s\\#\\s\\d*', ''))

output:

                                y
1 Solutions are welcomed, please 

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 TarJae