'Using a column as row names and set the other columns as binary values based on strings
I need one of my columns to be set as row name, accounting for duplicates, and the other columns should be separated by unique values into new columns assuming binary values if they are related to the row name.
Example:
ORIGINAL DATAFRAME
df <- data.frame(module = c("M1","M1","M1","M1","M1","M2"),
GO = c("inflama","inflama","ciclo","inflama","ciclo","sinapse"),
gene = c("PPARG","RELA","RELA","IRF5","ACKR1","GATA3"))
> df
module GO gene
1 M1 inflama PPARG
2 M1 inflama RELA
3 M1 ciclo RELA
4 M1 inflama IRF5
5 M1 ciclo ACKR1
6 M2 sinapse GATA3
FINAL DATAFRAME
df2 <- data.frame(gene = c("PPARG","RELA","IRF5","ACKR1","GATA3"),
M1 = c(1,1,1,1,0),
M2 = c(0,0,0,0,1),
inflama = c(1,1,1,0,0),
ciclo = c(0,1,0,1,0), sinapse = c(0,0,0,0,1))
> df2
gene M1 M2 inflama ciclo sinapse
1 PPARG 1 0 1 0 0
2 RELA 1 0 1 1 0
3 IRF5 1 0 1 0 0
4 ACKR1 1 0 0 1 0
5 GATA3 0 1 0 0 1
Thank you in advance for your help!
Solution 1:[1]
The Matrix package has a very efficient fac2sparse function for this purpose. If you downloaded R from CRAN, then you already have it installed. Here is an example of its usage:
library("Matrix")
f <- gl(3, 2L, labels = letters[1:3])
f
## [1] a a b b c c
## Levels: a b c
fac2sparse(f)
## 3 x 6 sparse Matrix of class "dgCMatrix"
##
## a 1 1 . . . .
## b . . 1 1 . .
## c . . . . 1 1
To include the resulting matrix in a long format data frame, you would transpose it with t and coerce it from sparse to dense format with as(Class="matrix").
fac2dense <- function(x) as(t(fac2sparse(x)), "matrix")
data.frame(f, fac2dense(f))
## f a b c
## 1 a 1 0 0
## 2 a 1 0 0
## 3 b 0 1 0
## 4 b 0 1 0
## 5 c 0 0 1
## 6 c 0 0 1
Putting this into practice, I would transform your data frame df in two steps, like so:
df_bin <- data.frame(df["gene"], fac2dense(df[["module"]]), fac2dense(df[["GO"]]))
df_bin
gene M1 M2 ciclo inflama sinapse
1 PPARG 1 0 0 1 0
2 RELA 1 0 0 1 0
3 RELA 1 0 1 0 0
4 IRF5 1 0 0 1 0
5 ACKR1 1 0 1 0 0
6 GATA3 0 1 0 0 1
aggregate(. ~ gene, df_bin, max)
gene M1 M2 ciclo inflama sinapse
1 ACKR1 1 0 1 0 0
2 GATA3 0 1 0 0 1
3 IRF5 1 0 0 1 0
4 PPARG 1 0 0 1 0
5 RELA 1 0 1 1 0
Solution 2:[2]
I think calling pivot_wider twice is good enough here.
library(tidyr)
df %>%
pivot_wider(names_from = module,
values_from = module,
values_fill = 0,
values_fn = \(x)1) %>%
pivot_wider(names_from = GO,
values_from = GO,
values_fill = 0,
values_fn = \(x)1)
# A tibble: 5 × 6
gene M1 M2 inflama ciclo sinapse
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 PPARG 1 0 1 0 0
2 RELA 1 0 1 1 0
3 IRF5 1 0 1 0 0
4 ACKR1 1 0 0 1 0
5 GATA3 0 1 0 0 1
If we want to make it a bit simpler, we can define a pivoting function:
pivoting<-function(x, y) pivot_wider(x,
names_from = y,
values_from = y,
values_fill = 0,
values_fn = \(x)1)
df %>%
pivoting('module') %>%
pivoting('GO')
But I think this is too much trouble as there is already a package for that. As mentioned by @akrun, I would go with dummy_cols
Solution 3:[3]
Using data.table is also an option:
library(data.table)
setDT(df)
Use dcast() to aggregate:
DT1 = dcast(df, gene ~ GO, fill = 0L, fun.agg = \(x) sum(!is.na(x)))
DT2 = dcast(df, gene ~ module, fill = 0L, fun.agg = \(x) ifelse(x == 0, 0, 1))
Then you can join the tables:
DT3 = DT1[DT2, on = "gene"]
DT3
Output
gene ciclo inflama sinapse M1 M2
1: ACKR1 1 0 0 1 0
2: GATA3 0 0 1 0 1
3: IRF5 0 1 0 1 0
4: PPARG 0 1 0 1 0
5: RELA 1 1 0 1 0
Solution 4:[4]
You should check out dplyr, tidyr, and tidyverse libraries for data wrangling in R.
## Loading the required libraries
library(dplyr)
library(tidyr)
library(tidyverse)
## Creating a pivot table
df_pivot = df %>%
pivot_longer(!gene, names_to = 'Attributes', values_to = "AttributeValue") %>% ## Wider to longer
group_by(gene,AttributeValue) %>% ## Grouping
summarise(Count = n()) %>% ## Aggregation
pivot_wider(names_from = AttributeValue , values_from = Count) ## Longer to wider
## Replacing NA with 0
df_pivot[is.na(df_pivot)] = 0
df_pivot
# A tibble: 5 x 6
# Groups: gene [5]
gene ciclo M1 M2 sinapse inflama
<chr> <int> <int> <int> <int> <int>
1 ACKR1 1 1 0 0 0
2 GATA3 0 0 1 1 0
3 IRF5 0 1 0 0 1
4 PPARG 0 1 0 0 1
5 RELA 1 2 0 0 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | |
| Solution 4 | Deepansh Arora |
