'How to prevent regmatches drop non matches?

I would like to capture the first match, and return NA if there is no match.

regexpr("a+", c("abc", "def", "cba a", "aa"), perl=TRUE)
# [1]  1 -1  3  1
# attr(,"match.length")
# [1]  1 -1  1  2

x <- c("abc", "def", "cba a", "aa")
m <- regexpr("a+", x, perl=TRUE)
regmatches(x, m)
# [1]  "a"  "a"  "aa"

So I expected "a", NA, "a", "aa"



Solution 1:[1]

use regexec instead, since it returns a list which will allow you to catch the character(0)'s before unlisting

 R <- regmatches(x, regexec("a+", x))
 unlist({R[sapply(R, length)==0] <- NA; R})

 # [1] "a"  NA   "a"  "aa"

Solution 2:[2]

In R 3.3.0, it is possible to pull out both the matches and the non-matched results using the invert=NA argument. From the help file, it says

if invert is NA, regmatches extracts both non-matched and matched substrings, always starting and ending with a non-match (empty if the match occurred at the beginning or the end, respectively).

The output is a list, typically, in most cases of interest, (matching a single pattern), regmatches with this argument will return a list with elements of either length 3 or 1. 1 is the case of where no matches are found and 3 is the case with a match.

myMatch <- regmatches(x, m, invert=NA)
myMatch
[[1]]
[1] ""   "a"  "bc"

[[2]]
[1] "def"

[[3]]
[1] "cb" "a"  " a"

[[4]]
[1] ""   "aa" ""

So to extract what you want (with "" in place of NA), you can use sapply as follows:

myVec <- sapply(myMatch, function(x) {if(length(x) == 1) "" else x[2]})
myVec
[1] "a"  ""   "a"  "aa"

At this point, if you really want NA instead of "", you can use

is.na(myVec) <- nchar(myVec) == 0L
myVec
[1] "a"  NA   "a"  "aa"

Some revisions:
Note that you can collapse the last two lines into a single line:

myVec <- sapply(myMatch, function(x) {if(length(x) == 1) NA_character_ else x[2]})

The default data type of NA is logical, so using it will result in additional data conversions. Using the character version NA_character_, avoids this.

An even slicker extraction method for the final line is to use [:

sapply(myMatch, `[`, 2)
[1] "a"  NA   "a"  "aa"

So you can do the whole thing in a fairly readable single line:

sapply(regmatches(x, m, invert=NA), `[`, 2)

Solution 3:[3]

Using more or less the same construction as yours -

chars <- c("abc", "def", "cba a", "aa")    

chars[
   regexpr("a+", chars, perl=TRUE) > 0
][1] #abc

chars[
   regexpr("q", chars, perl=TRUE) > 0
][1]  #NA

#vector[
#    find all indices where regexpr returned positive value i.e., match was found
#][return the first element of the above subset]

Edit - Seems like I misunderstood the question. But since two people have found this useful I shall let it stay.

Solution 4:[4]

You can use stringr::str_extract(string, pattern). It will return NA if there is no matches. It has simpler function interface than regmatches() as well.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ricardo Saporta
Solution 2
Solution 3
Solution 4 Martin Gal