'Get difference between column strings in R dataframe
I'm with a fundamental question in R:
Considering that I have a data frame, where each column represent the set of nucleotide mutations into two samples 'major' and 'minor'
major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")
df <- data.frame(major,minor)
tibble(df)
#A tibble: 1 x 2
major minor
<chr> <chr>
1 T2A,C26T,G652A T2A,C26T,G652A,C725T
And I want to identify the mutations present in 'minor' that aren't in 'major'.
I know that if those 'major' and 'minor' mutations were stored vectors, I could use setdiff to get this difference, but, the data that I received is stored as a long string with some mutations separated by comma, and I don't know how transform this column string to a column vector in the data frame to get this difference (I tried without success).
using the setdiff directly in the columns:
setdiff(df$minor, df$major)
# I got
[1] "T2A C26T G652A C725T"
The expected results was:
C725T
Could anyone help me?
Best,
Solution 1:[1]
Easiest way to do this; define major and minor as character vector
major <- c("T2A", "C26T", "G652A")
and
minor <- c("T2A", "C26T", "G652A", "C725T")
then
df <- tibble(major, minor)
setdiff(df$minor, df$major)
#> "C725T"
If not possible to split major and minor as character vector, you can use stringr package to do that job.
library(stringr)
major <- c("T2A,C26T,G652A")
minor <- c("T2A,C26T,G652A,C725T")
df <- tibble(
major = str_split(major, pattern = ",", simplify = TRUE),
minor = str_split(minor, pattern = ",", simplify = TRUE)
)
setdiff(df$minor, df$major)
#> "C725T"
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
