'Get difference between column strings in R dataframe

I'm with a fundamental question in R:

Considering that I have a data frame, where each column represent the set of nucleotide mutations into two samples 'major' and 'minor'

major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")

df <- data.frame(major,minor)
tibble(df)

#A tibble: 1 x 2
  major          minor               
  <chr>          <chr>               
1 T2A,C26T,G652A T2A,C26T,G652A,C725T

And I want to identify the mutations present in 'minor' that aren't in 'major'.

I know that if those 'major' and 'minor' mutations were stored vectors, I could use setdiff to get this difference, but, the data that I received is stored as a long string with some mutations separated by comma, and I don't know how transform this column string to a column vector in the data frame to get this difference (I tried without success).

using the setdiff directly in the columns:

setdiff(df$minor, df$major)
# I got
[1] "T2A C26T G652A C725T"

The expected results was:

C725T

Could anyone help me?

Best,



Solution 1:[1]

Easiest way to do this; define major and minor as character vector

major <- c("T2A", "C26T", "G652A")

and

minor <- c("T2A", "C26T", "G652A", "C725T")

then

df <- tibble(major, minor)
setdiff(df$minor, df$major)
#> "C725T"

If not possible to split major and minor as character vector, you can use stringr package to do that job.

library(stringr)

major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")

df <- tibble(
  major = str_split(major, pattern = ",", simplify = TRUE), 
  minor = str_split(minor, pattern = ",", simplify = TRUE)
)

setdiff(df$minor, df$major)
#> "C725T"

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1