'How to group multiple columns with NA values and discrepancies?
I'm looking for a way to group a data frame with multiple columns with missing values. I want to regroup every row that has a common value for each columns inspected and ignore if a missing value is present or a discrepancies in the data. The script should be independent of the order of appearance of missing values.
I succeed of doing so by iteration, but I would like a more efficient way of doing so by vectorizing the process. I used R software but I would also like to do it in Python.
As an example if a data frame as
df = data.frame("ID1"=c(NA,NA,"A","A","A","B","C","B","C"), "ID2"=c("D","E",NA,"D","E","F","F",NA,NA))
I want to obtain a final grouping vector as
c(1,1,1,1,1,2,2,2,2)
Where 1 and 2 can be any number, they should only be common between row that has a common value for any column.
I hope it's understandable ?
The easiest way I found, was by using a double iteration
df$GrpF = 1:dim(df)[1]
for (i in 1:dim(df)[1]){
for (ID in c("ID1","ID2")){
if (!is.na(df[i,ID])){
df$GrpF[df[ID]==df[i,ID]] = min(df$GrpF[df[ID]==df[i,ID]],na.rm = T)
}
}
}
Where df$GrpF is my final grouping vector. It works well and I don't have any duplicates when I summarise the information.
library(dplyr)
library(plyr)
dfG = df %>% group_by_("GrpF")%>%summarise_all(
function(x){
x1 = unique(x)
paste0(x1[!is.na(x1) & x1 != ""],collapse = "/")
}
)
But when I use my real data 60000 rows on 4 columns, it takes a lot of time (5 mins).
I tried using a single iteration by columns using the library dplyr and plyr
grpData = function(df, colGrp, colData, colReplBy = NA){
a = df %>% group_by_at(colGrp) %>% summarise_at(colData, function(x) { sort(x,na.last=T)[1]}) %>% filter_at(colGrp,all_vars(!is.na(.)))
b = plyr::mapvalues(df[[colGrp]], from=a[[colGrp]], to=a[[colData]])
if (is.na(colReplBy)) {
b[which(is.na(b))] = NA
}else if (colReplBy %in% colnames(df)) {
b[which(is.na(b))] = df[[colReplBy]][which(is.na(b))] #Set old value for missing values
}else {
stop("Col to use as replacement for missing values not present in dataframe")
}
return(b)
}
df$GrpF = 1:dim(df)[1]
for (ID in c("ID1","ID2")){
#Set all same old group same ID
df$IDN = grpData(df,"GrpF",ID)
#Set all same new ID the same old group
df$GrpN = grpData(df,"IDN","GrpF")
#Set all same ID the same new group
df$GrpN = grpData(df,ID,"GrpN")
#Set all same old group the same new group
df$GrpF = grpData(df,"GrpF","GrpN", colReplBy = "GrpF")
}
This does work (takes 30 sec for the real data) but I would like a more efficient way of doing so.
Do you have any ideas ?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
