'Splitting a comma- and semicolon-delimited string in R
I'm trying to split a string containing two entries and each entry has a specific format:
- Category (e.g.
active site/region) which is followed by a: - Term (e.g.
His, Glu/nucleotide-binding motif A) which is followed by a,
Here's the string that I want to split:
string <- "active site: His, Glu,region: nucleotide-binding motif A,"
This is what I have tried so far. Except for the two empty substrings, it produces the desired output.
unlist(str_extract_all(string, ".*?(?=,(?:\\w+|$))"))
[1] "active site: His, Glu" "" "region: nucleotide-binding motif A"
[4] ""
How do I get rid of the empty substrings?
Solution 1:[1]
You get the empty strings because .*? can also match an empty string where this assertion (?=,(?:\\w+|$)) is true
You can exclude matching a colon or comma using a negated character class before matching :
[^:,\n]+:.*?(?=,(?:\w|$))
Explanation
[^:,\n]+Match 1+ chars other than:,or a newline:Match the colon.*?Match any char as least as possbiel(?=Positive lookahead, assert that what is directly to the right from the current position:,Match literally(?:\w|$)Match either a single word char, or assert the end of the string
)Close the lookahead
string <- "active site: His, Glu,region: nucleotide-binding motif A,"
unlist(str_extract_all(string, "[^:,\\n]+:.*?(?=,(?:\\w|$))"))
Output
[1] "active site: His, Glu" "region: nucleotide-binding motif A"
Solution 2:[2]
Much longer and not as elegant as @The fourth bird +1, but it works:
library(stringr)
string2 <- strsplit(string, "([^,]+,[^,]+),", perl = TRUE)[[1]][2]
string1 <- str_replace(string, string2, "")
string <- str_replace_all(c(string1, string2), '\\,$', '')
> string
[1] "active site: His, Glu"
[2] "region: nucleotide-binding motif A"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | TarJae |
