'Extract just the part of string that matches a regex pattern in R

I build a data frame scrapped automatically from a web page on which one of the variables is a date in the text form “May 12”.

Nevertheless, sometimes observations came with some characters (in some cases weird ones) attached after the date, for example: “May 20 õ", "Dez 1", "Oct 12ABCdáé". For those cases, I want to replace the value with the correct characters, thus: “Dec 24”, “Oct 1”.

After googling for a solution several times and trying functions like: sub, gsub and grep , I could not find the way to find a correct function to work.

I see that regular expressions has a steep learning curve, but after using the tool http://regexr.com/ I could define the regular expression to match the pattern in the observations where the problems appears. ([A-Z]{1}[a-z]{2})\s\d+.*

At this moment, I have the following example:

vector = c("May 20", "Dez 1", "Oct 12ABCdáé”)

And the last solution I tried is:

dateformat = gsub(pattern = "([A-Z]{1}[a-z]{2})\\s\\d+.*", replacement = "([A-Z]{1}[a-z]{2})\\s\\d+", x = vector)

But of course this gives me a replacement with the text string "([A-Z]{1}[a-z]{2})\s\d+” on each of them.

> dateformat
[1] "([A-Z]{1}[a-z]{2})sd+" "([A-Z]{1}[a-z]{2})sd+"
[3] "([A-Z]{1}[a-z]{2})sd+"

I really do not understand what I have to include in the replacement argument to remove the bad characters if they exists.



Solution 1:[1]

I added a capture group and a back-reference "\\1":

sub("^([A-Z]{1}[a-z]{2}\\s\\d+).*", "\\1", vector)
[1] "May 20" "Dez 1"  "Oct 12"

The replacement argument accepts back-references like '\\1', but not typical regex patterns as you used. The back-reference refers back to the pattern you created and the capture group you defined. In this case our capture group was the abbreviated month and day which we outlined with parantheticals (..). Any text captured within those brackets are returned when "\\1" is placed in the replacement argument.

This quick-start guide may help

Solution 2:[2]

We could also try

sub("\\s*[^0-9]+$", "", vector)
#[1] "May 20" "Dez 1"  "Oct 12"

Solution 3:[3]

In case anyone else is interested in the performance of these different approaches, here is a repeatable example comparing Pierre's approach to akrun's approach.

This shows akrun's approach is faster:

library(microbenchmark)
set.seed(1234)

# Original poster's data
# vector <- c("May 20", "Dez 1", "Oct 12ABCdáé")

# Increased the size to 200 
vector <- sample(c("May 20", "Dez 1", "Oct 12ABCdáé"), 200L, replace = TRUE)

# Comparison of timings with 10000 repetitions
microbenchmark(
  pierre_l = sub("^([A-Z]{1}[a-z]{2}\\s\\d+).*", "\\1", vector),
  akrun = sub("\\s*[^0-9]+$", "", vector),
  times = 10000L
)
#> Unit: microseconds
#>      expr     min      lq     mean  median       uq     max neval
#>  pierre_l 164.201 169.201 233.5096 173.302 220.2515 17809.1 10000
#>     akrun 159.001 164.202 228.9020 168.200 212.7010 13443.5 10000

Created on 2022-03-24 by the reprex package (v2.0.1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 akrun
Solution 3