'Remove punctuation from text (except the symbol &)
I need to remove punctuation from the text:
data <- "Type the command AT&W enter. in order to save the new protocol on modem;"
gsub('[[:punct:] ]+',' ',data)
This solution gives the result
[1] "Type the command AT W enter in order to save the new protocol on modem "
This is not the desired result because I would like to save &, hence:
[1] "Type the command AT&W enter in order to save the new protocol on modem "
Solution 1:[1]
What about doing the inverse? i.e. replacing everything that is not a letter, a digit or a & with an empty string:
gsub("[^[:alnum:][:space:]&]", "", data)
# [1] "Type the command AT&W enter in order to save the new protocol on modem"
Solution 2:[2]
You could try a user defined regex consisting of anything that is not an $ or an alpha numeric:
data <- "Type the command AT&W enter. in order to save the new protocol on modem;"
gsub('[^&[:alnum:] ]+',' ',data)
Solution 3:[3]
Here's another regex, which literally means "find all punctuations except &".
gsub("[^\\P{P}&]", "", data, perl = T)
[1] "Type the command AT&W enter in order to save the new protocol on modem"
Solution 4:[4]
Another possible solution, based on stringr:
library(stringr)
str_remove_all(data, "(?!&)[[:punct:]]")
#> [1] "Type the command AT&W enter in order to save the new protocol on modem"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | MatthewR |
| Solution 3 | benson23 |
| Solution 4 |
