'How to merge columns and delete duplicates but keep unique values?

I want to merge columns based on same IDs and want to make sure to consolidate the rows into just one row (per ID). Can anyone help me to merge the columns for duplicates and non-duplicates?

Given:

ID      Name     Degree       AM_Class     PM_Class     Online_Class
01      Kathy    Biology      Bio101       NaN          NaN
01      Kathy    Biology      NaN          Chem101      NaN
02      James    Chemistry    NaN          Chem101      NaN
03      Henry    Business     Bus100       NaN          NaN
03      Henry    Business     NaN          Math100      NaN
03      Henry    Business     NaN          NaN          Acct100

Expected Output:

ID      Name     Degree       AM_Class     PM_Class     Online_Class
01      Kathy    Biology      Bio101       Chem101      NaN
02      James    Chemistry    NaN          Chem101      NaN
03      Henry    Business     Bus100       Math100      Acct100

I tried to use:

df = df.groupby(['Name','Degree','ID'])['AM_Class', 'PM_Class', 'Online_Class'].apply(', '.join).reset_index()

but seems like it's giving an error..



Solution 1:[1]

If need first non missing values per groups use GroupBy.first:

df = df.groupby(['ID','Name','Degree'], as_index=False).first()
print (df)
   ID   Name     Degree AM_Class PM_Class Online_Class
0  01  Kathy    Biology   Bio101  Chem101         None
1  02  James  Chemistry     None  Chem101         None
2  03  Henry   Business   Bus100  Math100      Acct100

Or if need all unique values without missing values per groups use custom lambda function in GroupBy.agg for processing each column separately by Series.dropna, removed duplicated by dict.fromkeys and last join values by ,:

f = lambda x: ', '.join(dict.fromkeys(x.dropna()))
df = df.groupby(['ID','Name','Degree'], as_index=False).agg(f).replace('', np.nan)

Difference is possible see in changed data:

print (df)
   ID   Name     Degree AM_Class PM_Class Online_Class
0  01  Kathy    Biology   Bio101      NaN          NaN
1  01  Kathy    Biology      NaN  Chem101          NaN
2  02  James  Chemistry      NaN  Chem101          NaN
3  03  Henry   Business   Bus100      NaN          NaN
4  03  Henry   Business      NaN  Math100      Acct100
5  03  Henry   Business      NaN  Math200      Acct100

df1 = df.groupby(['ID','Name','Degree'], as_index=False).first()
print (df1)
   ID   Name     Degree AM_Class PM_Class Online_Class
0  01  Kathy    Biology   Bio101  Chem101         None
1  02  James  Chemistry     None  Chem101         None
2  03  Henry   Business   Bus100  Math100      Acct100


f = lambda x: ', '.join(dict.fromkeys(x.dropna()))
df2 = df.groupby(['ID','Name','Degree'], as_index=False).agg(f).replace('', np.nan)
print (df2)
   ID   Name     Degree AM_Class          PM_Class Online_Class
0  01  Kathy    Biology   Bio101           Chem101          NaN
1  02  James  Chemistry      NaN           Chem101          NaN
2  03  Henry   Business   Bus100  Math100, Math200      Acct100

Solution 2:[2]

Here is your data:

df = pd.DataFrame({'ID': ['01', '01', '02', '03', '03', '03'],
                   'Degree': ['Biology', 'Biology', 'Chemistry', 'Business', 'Business', 'Business'],
                   'Name': ['Kathy', 'Kathy', 'James', 'Henry', 'Henry', 'Henry'],
                   'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
                   'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
                   'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100']})

You can separate the data frames, remove the NaN values, then rejoin them.

The reduce() function allows the merge to be performed iteratively, without having to merge the data frames one by one.

from functools import reduce

# Separate the data frames
df_student = df[['ID', 'Name', 'Degree']]
df_AM = df[['ID', 'Name', 'AM_Class']]
df_PM = df[['ID', 'Name', 'PM_Class']]
df_OL = df[['ID', 'Name', 'Online_Class']]

# List of data frames
dfs = [df_student, df_AM, df_PM, df_OL]

# Remove all NaNs
for df in dfs:
    df.dropna(inplace=True)

# Merge dataframes without the NaNs
df_merged = reduce(lambda left, right: pd.merge(left, right, how='left', on=['ID', 'Name']), dfs)


    ID  Name    Degree      AM_Class    PM_Class    Online_Class
0   01  Kathy   Biology     Bio101      Chem101     NaN
1   01  Kathy   Biology     Bio101      Chem101     NaN
2   02  James   Chemistry   NaN         Chem101     NaN
3   03  Henry   Business    Bus100      Math100     Acct100
4   03  Henry   Business    Bus100      Math100     Acct100
5   03  Henry   Business    Bus100      Math100     Acct100

Then you just need to remove the duplicates.

df_merged.drop_duplicates(inplace=True).reset_index()

This is the result:

     ID Name    Degree      AM_Class    PM_Class    Online_Class
0    01 Kathy   Biology     Bio101      Chem101     NaN
1    02 James   Chemistry   NaN         Chem101     NaN
2    03 Henry   Business    Bus100      Math100     Acct100

Solution 3:[3]

You may ffill rows first and then drop duplicates while keeping the last occurrence of duplicates,

df.groupby(['ID']).ffill().drop_duplicates(subset='Name', keep='last')

Solution 4:[4]

See my alternative solution below.

import pandas as pd, numpy as np
df = pd.DataFrame({'ID': ['01', '01', '02', '03', '03', '03'],
                   'Degree': ['Biology', 'Biology', 'Chemistry', 'Business', 'Business', 'Business'],
                   'Name': ['Kathy', 'Kathy', 'James', 'Henry', 'Henry', 'Henry'],
                   'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
                   'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
                   'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100']})

# merge duplicates, reset index
df = df.fillna('').groupby(['Name','Degree','ID'])[df.columns].agg(lambda x: ','.join(filter(None, x))).reset_index(drop=False)

Output:

    Name    Degree      ID      AM_Class    PM_Class    Online_Class
0   Henry   Business    03      Bus100      Math100     Acct100
1   James   Chemistry   02                  Chem101 
2   Kathy   Biology     01      Bio101      Chem101 

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 ????????
Solution 4