'How to detect range of positions of specific set of characters in a string

I have the following sequence:

my_seq <- "----?????-----?V?D????-------???IL??A?---"

What I want to do is to detect range of positions of non-dashed characters.

----?????-----?V?D????-------???IL??A?---
|   |   |     |      |       |       |  
1   5   9    15     22      30      38

The final output will be a vector of strings:

out <- c("5-9", "15-22", "30-38")

How can I achieve that with R?



Solution 1:[1]

You could do:

my_seq <- "----?????-----?V?D????-------???IL??A?---"

non_dash <- which(strsplit(my_seq, "")[[1]] != '-')
pos      <- non_dash[c(0, diff(non_dash)) != 1 | c(diff(non_dash), 0) != 1]

apply(matrix(pos, ncol = 2, byrow = TRUE), 1, function(x) paste(x, collapse = "-"))
#> [1] "5-9"   "15-22" "30-38"

Created on 2022-02-18 by the reprex package (v2.0.1)

Solution 2:[2]

Inspired from @lovalery's great answer, a base R solution is:

g <- gregexpr(pattern = "[^-]+", my_seq)
d <-data.frame(start = unlist(g), 
           end = unlist(g) + attr(g[[1]], "match.length") - 1)
paste(s$start, s$end, sep ="-")
# [1] "1-5"   "11-18" "26-34"

Solution 3:[3]

A one-liner in base R with utf8ToInt

apply(matrix(which(diff(c(FALSE, utf8ToInt(my_seq) != 45L, FALSE)) != 0) - 0:1, 2), 2, paste, collapse = "-")
#> [1] "5-9"   "15-22" "30-38"

Solution 4:[4]

Try

paste0(gregexec('-\\?', my_seq)[[1]][1,] + 1, '-',
       gregexec('\\?-', my_seq)[[1]][1,])
#> [1] "5-9"   "15-22" "30-38"

Solution 5:[5]

Here is a rle + tidyverse approach:

library(dplyr)
with(rle(strsplit(my_seq, "")[[1]] != "-"),
     data.frame(lengths, values)) |>
  mutate(end = cumsum(lengths)) |>
  mutate(start =  1 + lag(end, 1,0)) |>
  mutate(rng = paste(start, end, sep = "-")) |>
  filter(values) |>
  pull(rng)

[1] "5-9"   "15-22" "30-38"

However if you don't mind installing S4Vectors the code can be made really terse:

library(S4Vectors)

r <- Rle(strsplit(my_seq, "")[[1]] != "-")

paste(start(r), end(r), sep = "-")[runValue(r)]

[1] "5-9"   "15-22" "30-38"

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Allan Cameron
Solution 2
Solution 3
Solution 4
Solution 5 Stefano Barbi