'Remove parentheses and text within from strings in R
In R, I have a list of companies such as:
companies <- data.frame(Name=c("Company A Inc (COMPA)","Company B (BEELINE)", "Company C Inc. (Coco)", "Company D Inc.", "Company E"))
I want to remove the text with parenthesis, ending up with the following list:
Name
1 Company A Inc
2 Company B
3 Company C Inc.
4 Company D Inc.
5 Company E
One approach I tried was to split the string and then use ldply:
companies$Name <- as.character(companies$Name)
c<-strsplit(companies$Name, "\\(")
ldply(c)
But because not all company names have parentheses portions, it fails:
Error in list_to_dataframe(res, attr(.data, "split_labels"), .id, id_as_factor) :
Results do not have equal lengths
I'm not married to the strsplit solution. Whatever removes that text and the parentheses would be fine.
Solution 1:[1]
A gsub should work here
gsub("\\s*\\([^\\)]+\\)","",as.character(companies$Name))
# or using "raw" strings as of R 4.0
gsub(r"{\s*\([^\)]+\)}","",as.character(companies$Name))
# [1] "Company A Inc" "Company B" "Company C Inc."
# [4] "Company D Inc." "Company E"
Here we just replace occurrences of "(...)" with nothing (also removing any leading space). R makes it look worse than it is with all the escaping we have to do for the parenthesis since they are special characters in regular expressions.
Solution 2:[2]
You could use stringr::str_replace. It's nice because it accepts factor variables.
companies <- data.frame(Name=c("Company A Inc (COMPA)","Company B (BEELINE)",
"Company C Inc. (Coco)", "Company D Inc.",
"Company E"))
library(stringr)
str_replace(companies$Name, " \\s*\\([^\\)]+\\)", "")
# [1] "Company A Inc" "Company B" "Company C Inc."
# [4] "Company D Inc." "Company E"
And if you still want to use strsplit, you could do
companies$Name <- as.character(companies$Name)
unlist(strsplit(companies$Name, " \\(.*\\)"))
# [1] "Company A Inc" "Company B" "Company C Inc."
# [4] "Company D Inc." "Company E"
Solution 3:[3]
You could also use:
library(qdap)
companies$Name <- genX(companies$Name, " (", ")")
companies
Name
1 Company A Inc
2 CompanyB
3 Company C Inc.
4 Company D Inc.
5 CompanyE
Solution 4:[4]
If the parentheses are paired and balanced, you can use
gsub("\\s*(\\([^()]*(?:(?1)[^()]*)*\\))", "", x, perl=TRUE)
See the regex and R demo online:
companies <- data.frame(Name=c("Company A Inc (COMPA)","Company B (BEELINE)", "Company C Inc. (Coco)", "Company D Inc.", "Company E"))
gsub("\\s*(\\([^()]*(?:(?1)[^()]*)*\\))", "", companies$Name, perl=TRUE)
Output:
[1] "Company A Inc" "Company B" "Company C Inc." "Company D Inc."
[5] "Company E"
Regex details
\s*- zero or more whitespaces(\([^()]*(?:(?1)[^()]*)*\))- Capturing group 1 (required to recurse the pattern part between parentheses):\(- a(char[^()]*- zero or more chars other than(and)(?:(?1)[^()]*)*- zero or more occurrences of the whole Group 1 pattern ((?1)is a regex subroutine recursing Group 1 pattern) and then zero or more chars other than(and)\)- a)char.
Solution 5:[5]
In your case it will come to the desired result, wenn you remove everything starting with (.
sub(" \\(.*", "", companies$Name)
#[1] "Company A Inc" "Company B" "Company C Inc." "Company D Inc." "Company E"
To remove parentheses and text within from a strings you can use.
sub("\\(.*)", "", c("ab (cd) ef", "(ij) kl"))
#[1] "ab ef" " kl"
If there are more than one parentheses:
gsub("\\(.*?)", "", c("ab (cd) ef (gh)", "(ij) kl"))
#[1] "ab ef " " kl"
( needs to be escaped \\(, . means everything, * means repeated 0 to n, ? means non greedy to remove not everything from the first to the last match.
As an alternative you can use [^)] what means everything but not a ).
sub("\\([^)]*)", "", c("ab (cd) ef", "(ij) kl"))
#[1] "ab ef" " kl"
gsub("\\([^)]*)", "", c("ab (cd) ef (gh)", "(ij) kl"))
#[1] "ab ef " " kl"
If there are nested parentheses:
gsub("\\(([^()]|(?R))*\\)", "", c("ab ((cd) ef) gh (ij)", "(ij) kl"), perl=TRUE)
#[1] "ab gh " " kl"
Where a(?R)z is a recursion which match one or more letters a followed by exactly the same number of letters z.
Solution 6:[6]
library(qdap)
bracketX(companies$Name) -> companies$Name
Solution 7:[7]
Another gsub solution: replace the term in the parens preceded by an optional space by "", i.e. empty string
gsub("(\\s*\\(\\w+\\))", "", companies$Name)
[1] "Company A Inc" "Company B" "Company C Inc." "Company D Inc."
[5] "Company E"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Gregor Thomas |
| Solution 3 | |
| Solution 4 | Wiktor Stribiżew |
| Solution 5 | |
| Solution 6 | Thushara Dulam |
| Solution 7 | Eyayaw |
