'gsub: How to extract words between two words
I know a lot of people have already posted some issues related to mine, but I couldn't found the correct solution.
I have a lot of sentences like: "Therapie: I like the elephants so much Indication"
I want to extract all the words between "Therapie:" and "Indication" in the provided example above would it be "I like the elephants so much".
When I use my code I get always the next 3 words back. What am I doing wrong?
my_df <- c("Therapie: I like the elephants so much Indication")
These are sentences out of the documents and I need just all the words between "Therapie: and Indikation:"
Examples:
____________________________________________________________________________ _____ Diagnose: Blepharochalasis Therapie: Oberlidplastik und Fettresektion mediales und nasales Pocket Indikation:
____________________________________________________________________________ _____ Diagnose: Mammahypoplasie Therapie: Dual Plane Augmentation bds. über IMF Schnitt Indikation:
exc <- sub(".*?\\bTherapie\\W+(\\w+(?:\\W+\\w+){0,2}).*", "\\1", my_df, to = "documents")`, perl=TRUE)
Solution 1:[1]
With str_match. \\s* allows to trim whitespace.
str <- "Therapie: I like the elephants so much Indication"
library(stringr)
str_match(str, "Therapie:\\s*(.*?)\\s*Indication")[, 2]
# [1] "I like the elephants so much"
What about a custom function?
str_between <- function(str, w1, w2){
stringr::str_match(str, paste0(w1, "\\s*(.*?)\\s*", w2))[, 2]
}
str_between(str, "Therapie:", "Indication")
# [1] "I like the elephants so much"
Solution 2:[2]
You can do
my_df <- c("Therapie: I like the elephants so much Indication")
sub("^Therapie: (.*) Indication$", "\\1", my_df)
#> [1] "I like the elephants so much"
Solution 3:[3]
An option with trimws from base R
trimws(str, whitespace = ".*:\\s+|\\s+Indication.*")
[1] "I like the elephants so much"
data
str <- "Therapie: I like the elephants so much Indication"
Solution 4:[4]
Another way using strsplit:
str <- "Therapie: I like the elephants so much Indication"
!strsplit(str, " ")[[1]] %in% c("Therapie:", "Indication") -> x
paste0(strsplit(str, " ")[[1]][x], collapse = ' ')
#"I like the elephants so much"
Solution 5:[5]
Another option with a match only:
str <- "Therapie: I like the elephants so much Indication"
regmatches(str, regexpr("\\bTherapie:\\h*\\K.*?(?=\\h*\\bIndication\\b)", str, perl=TRUE))
Output
[1] "I like the elephants so much"
The pattern matches:
\bTherapie:A word boundary to prevent matching a partial word, match the wordTherapieand:\h*\KMatch optional spaces and clear clear what is matched so far.*?Match as least as possible(?=\h*\bIndication\b)Positive lookahead, assert optional spaces and the wordIndicationto the right
See an R demo.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | akrun |
| Solution 4 | AlexB |
| Solution 5 | The fourth bird |
