'How to extract unique letters among word of consecutive letters?

My question might not be clear, so I'll explain my problem using simple example.

For example, there is character x = "AAATTTGGAA".

What I want to achieve is, from x, split x by consecutive letters, "AAA", "TTT", "GG", "AA".

Then, unique letters of each chunk is "A", "T", "G", "A" , so expected output is ATGA.

How should I get this?

I apologize if this is duplicated, but I cannot find about this problem.

r


Solution 1:[1]

Here is a useful regex trick approach:

x <- "AAATTTGGAA"
out <- strsplit(x, "(?<=(.))(?!\\1)", perl=TRUE)[[1]]
out

[1] "AAA" "TTT" "GG"  "AA"

The regex pattern used here says to split at any boundary where the preceding and following characters are different.

(?<=(.))  lookbehind and also capture preceding character in \1
(?!\\1)   then lookahead and assert that following character is different

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Tim Biegeleisen