'Creating a list of 6-character strings that have a distance/ difference of at least 3 per string? (for a DNA/ oligo-related problem)

I want to create a list of strings comprised of only A's G's C's and T's that have a difference/ distance of at least 3, eg (two strings/ oligos could be eg. ATCTGA and TAGTGC). I can create all the possible combinations of a 6nt string, but I can't work out how to select only a subset 3-distant oligos. I know that there will be more than one list, but any list would do.

Not really done much DNA data manipulation so I am unsure how to approach this, would appreciate any suggestions of any tools out there.

Thank

r


Solution 1:[1]

Given a reference oligo x of nonzero length, this function returns a character vector listing all oligos of equal length whose Hamming distance from x is at least mindist.

oligo1 <- function(x, mindist = 0L) {
    acgt <- c("A", "C", "G", "T")
    x <- match(strsplit(x, "")[[1L]], acgt)
    if ((n <- length(x)) == 0L || anyNA(x)) {
        stop("'x' is not a valid oligo.")
    }
    if (mindist > n) {
        return(character(0L))
    }
    P <- gtools::permutations(4L, n, repeats.allowed = TRUE)
    if (mindist > 0L) {
        P <- P[rowSums(P != rep.int(x, rep.int(4^n, n))) >= mindist, , drop = FALSE]
    }
    m <- nrow(P)
    do.call(paste0, split(acgt[P], gl(n, m)))
}
oligo1("AA", 0L)
##  [1] "AA" "AC" "AG" "AT" "CA" "CC" "CG" "CT" "GA" "GC"
## [11] "GG" "GT" "TA" "TC" "TG" "TT"

oligo1("AA", 1L)
##  [1] "AC" "AG" "AT" "CA" "CC" "CG" "CT" "GA" "GC" "GG"
## [11] "GT" "TA" "TC" "TG" "TT"

oligo1("AA", 2L)
##  [1] "CC" "CG" "CT" "GC" "GG" "GT" "TC" "TG" "TT"

Employing the above recursively, you can find the largest set containing x whose elements mutually satisfy the condition on Hamming distance. More precisely, you can construct the longest y such that x %in% y and the Hamming distance from y[i] to y[j] is at least mindist for all i != j.

oligo2 <- function(x, mindist = 0L) {
    y <- c(x, oligo1(x, mindist))
    n <- length(y)
    pos <- 2L
    while (pos < n) {
        y <- c(y[1:pos], intersect(y[(pos+1L):n], oligo1(y[pos], mindist)))
        n <- length(y)
        pos <- pos + 1L
    }
    y
}
oligo2("AA", 0L)
##  [1] "AA" "AA" "AC" "AG" "AT" "CA" "CC" "CG" "CT" "GA"
## [11] "GC" "GG" "GT" "TA" "TC" "TG" "TT"

oligo2("AA", 1L)
##  [1] "AA" "AC" "AG" "AT" "CA" "CC" "CG" "CT" "GA" "GC"
## [11] "GG" "GT" "TA" "TC" "TG" "TT"

oligo2("AA", 2L)
##  [1] "AA" "CC" "GG" "TT"

Hence one possible answer to your question would be:

oligo2("AAAAAA", 3L)
##  [1] "AAAAAA" "AAACCC" "AAAGGG" "AAATTT" "AACACG"
##  [6] "AACCAT" "AACGTA" "AACTGC" "AAGAGT" "AAGCTG"
## [11] "AAGGAC" "AAGTCA" "AATATC" "AATCGA" "AATGCT"
## [16] "AATTAG" "ACAACT" "ACACAG" "ACAGTC" "ACATGA"
## [21] "ACCAAC" "ACCCCA" "ACCGGT" "ACCTTG" "ACGATA"
## [26] "ACGCGC" "ACGGCG" "ACGTAT" "ACTAGG" "ACTCTT"
## [31] "ACTGAA" "ACTTCC" "AGAAGC" "AGACTA" "AGAGAT"
## [36] "AGATCG" "AGCATT" "AGCCGG" "AGCGCC" "AGCTAA"
## [41] "AGGAAG" "AGGCCT" "AGGGGA" "AGGTTC" "AGTACA"
## [46] "AGTCAC" "AGTGTG" "AGTTGT" "ATAATG" "ATACGT"
## [51] "ATAGCA" "ATATAC" "ATCAGA" "ATCCTC" "ATCGAG"
## [56] "ATCTCT" "ATGACC" "ATGCAA" "ATGGTT" "ATGTGG"
## [61] "ATTAAT" "ATTCCG" "ATTGGC" "ATTTTA"

The length-6 oligos in this list are mutually at least 3-distant.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1