'Invalid regexp in R

I'm trying to use this regexp in R:

\?(?=([^'\\]*(\\.|'([^'\\]*\\.)*[^'\\]*'))*[^']*$)

I'm escaping like so:

\\?(?=([^'\\\\]*(\\\\.|'([^'\\\\]*\\\\.)*[^'\\\\]*'))*[^']*$)

I get an invalid regexp error.

Regexpal has no problem with the regex, and I've checked that the interpreted regex in the R error message is the exact same as what I'm using in Regex pal, so I'm sort of at a loss. I don't think the escaping is the problem.

Code:

output <- sub("\\?(?=([^'\\\\]*(\\\\.|'([^'\\\\]*\\\\.)*[^'\\\\]*'))*[^']*$)", "!", "This is a test string?")


Solution 1:[1]

R by default uses the POSIX (Portable Operating System Interface) standard of regular expressions (see these SO posts [1,2] and ?regex [caveat emptor: machete-level density ahead]).

Look-ahead ((?=...)), look-behind ((?<=...)) and their negations ((?!...) and (?<!...)) are probably the most salient examples of PCRE-specific (Perl-Compatible Regular Expressions) forms, which are not compatible with POSIX.

R can be trained to understand your regex by activating the perl option to TRUE; this option is available in all of the base regex functions (gsub, grepl, regmatches, etc.):

output <- sub(
  "\\?(?=([^'\\\\]*(\\\\.|'([^'\\\\]*\\\\.)*[^'\\\\]*'))*[^']*$)",
  "!",
  "This is a test string?",
  perl = TRUE
)

Of course it looks much less intimidating for R>=4.0 which has raw string support:

output <- sub(
  R"(\?(?=([^'\\]*(\\.|'([^'\\]*\\.)*[^'\\]*'))*[^']*$))",
  "!",
  "This is a test string?",
  perl = TRUE
)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1