'How to group multiple columns with NA values and discrepancies?

I'm looking for a way to group a data frame with multiple columns with missing values. I want to regroup every row that has a common value for each columns inspected and ignore if a missing value is present or a discrepancies in the data. The script should be independent of the order of appearance of missing values.

I succeed of doing so by iteration, but I would like a more efficient way of doing so by vectorizing the process. I used R software but I would also like to do it in Python.

As an example if a data frame as

df = data.frame("ID1"=c(NA,NA,"A","A","A","B","C","B","C"), "ID2"=c("D","E",NA,"D","E","F","F",NA,NA))

I want to obtain a final grouping vector as

c(1,1,1,1,1,2,2,2,2)

Where 1 and 2 can be any number, they should only be common between row that has a common value for any column.

I hope it's understandable ?

The easiest way I found, was by using a double iteration

df$GrpF = 1:dim(df)[1]
for (i in 1:dim(df)[1]){
    for (ID in c("ID1","ID2")){
        if (!is.na(df[i,ID])){
            df$GrpF[df[ID]==df[i,ID]] = min(df$GrpF[df[ID]==df[i,ID]],na.rm = T)
        }
    }
}

Where df$GrpF is my final grouping vector. It works well and I don't have any duplicates when I summarise the information.

library(dplyr)
library(plyr)
dfG = df %>% group_by_("GrpF")%>%summarise_all(
    function(x){
        x1 = unique(x)
        paste0(x1[!is.na(x1) & x1 != ""],collapse = "/")
   }
)

But when I use my real data 60000 rows on 4 columns, it takes a lot of time (5 mins).

I tried using a single iteration by columns using the library dplyr and plyr

grpData = function(df, colGrp, colData, colReplBy = NA){
    a = df %>% group_by_at(colGrp) %>% summarise_at(colData, function(x) { sort(x,na.last=T)[1]}) %>% filter_at(colGrp,all_vars(!is.na(.)))
    b = plyr::mapvalues(df[[colGrp]], from=a[[colGrp]], to=a[[colData]])
    if (is.na(colReplBy)) {
        b[which(is.na(b))] = NA
    }else if (colReplBy %in% colnames(df)) {
        b[which(is.na(b))] = df[[colReplBy]][which(is.na(b))] #Set old value for missing values
    }else {
        stop("Col to use as replacement for missing values not present in dataframe")
    }
    return(b)
}

df$GrpF = 1:dim(df)[1]
for (ID in c("ID1","ID2")){
    #Set all same old group same ID
    df$IDN = grpData(df,"GrpF",ID)

    #Set all same new ID the same old group
    df$GrpN = grpData(df,"IDN","GrpF")
    
    #Set all same ID the same new group
    df$GrpN = grpData(df,ID,"GrpN")
    
    #Set all same old group the same new group
    df$GrpF = grpData(df,"GrpF","GrpN", colReplBy = "GrpF")

}

This does work (takes 30 sec for the real data) but I would like a more efficient way of doing so.

Do you have any ideas ?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source