'How to detect range of positions of specific set of characters in a string
I have the following sequence:
my_seq <- "----?????-----?V?D????-------???IL??A?---"
What I want to do is to detect range of positions of non-dashed characters.
----?????-----?V?D????-------???IL??A?---
| | | | | | |
1 5 9 15 22 30 38
The final output will be a vector of strings:
out <- c("5-9", "15-22", "30-38")
How can I achieve that with R?
Solution 1:[1]
You could do:
my_seq <- "----?????-----?V?D????-------???IL??A?---"
non_dash <- which(strsplit(my_seq, "")[[1]] != '-')
pos <- non_dash[c(0, diff(non_dash)) != 1 | c(diff(non_dash), 0) != 1]
apply(matrix(pos, ncol = 2, byrow = TRUE), 1, function(x) paste(x, collapse = "-"))
#> [1] "5-9" "15-22" "30-38"
Created on 2022-02-18 by the reprex package (v2.0.1)
Solution 2:[2]
Inspired from @lovalery's great answer, a base R solution is:
g <- gregexpr(pattern = "[^-]+", my_seq)
d <-data.frame(start = unlist(g),
end = unlist(g) + attr(g[[1]], "match.length") - 1)
paste(s$start, s$end, sep ="-")
# [1] "1-5" "11-18" "26-34"
Solution 3:[3]
A one-liner in base R with utf8ToInt
apply(matrix(which(diff(c(FALSE, utf8ToInt(my_seq) != 45L, FALSE)) != 0) - 0:1, 2), 2, paste, collapse = "-")
#> [1] "5-9" "15-22" "30-38"
Solution 4:[4]
Try
paste0(gregexec('-\\?', my_seq)[[1]][1,] + 1, '-',
gregexec('\\?-', my_seq)[[1]][1,])
#> [1] "5-9" "15-22" "30-38"
Solution 5:[5]
Here is a rle + tidyverse approach:
library(dplyr)
with(rle(strsplit(my_seq, "")[[1]] != "-"),
data.frame(lengths, values)) |>
mutate(end = cumsum(lengths)) |>
mutate(start = 1 + lag(end, 1,0)) |>
mutate(rng = paste(start, end, sep = "-")) |>
filter(values) |>
pull(rng)
[1] "5-9" "15-22" "30-38"
However if you don't mind installing S4Vectors the code can be made really terse:
library(S4Vectors)
r <- Rle(strsplit(my_seq, "")[[1]] != "-")
paste(start(r), end(r), sep = "-")[runValue(r)]
[1] "5-9" "15-22" "30-38"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Allan Cameron |
| Solution 2 | |
| Solution 3 | |
| Solution 4 | |
| Solution 5 | Stefano Barbi |
