'How to merge columns and delete duplicates but keep unique values?
I want to merge columns based on same IDs and want to make sure to consolidate the rows into just one row (per ID). Can anyone help me to merge the columns for duplicates and non-duplicates?
Given:
ID Name Degree AM_Class PM_Class Online_Class
01 Kathy Biology Bio101 NaN NaN
01 Kathy Biology NaN Chem101 NaN
02 James Chemistry NaN Chem101 NaN
03 Henry Business Bus100 NaN NaN
03 Henry Business NaN Math100 NaN
03 Henry Business NaN NaN Acct100
Expected Output:
ID Name Degree AM_Class PM_Class Online_Class
01 Kathy Biology Bio101 Chem101 NaN
02 James Chemistry NaN Chem101 NaN
03 Henry Business Bus100 Math100 Acct100
I tried to use:
df = df.groupby(['Name','Degree','ID'])['AM_Class', 'PM_Class', 'Online_Class'].apply(', '.join).reset_index()
but seems like it's giving an error..
Solution 1:[1]
If need first non missing values per groups use GroupBy.first:
df = df.groupby(['ID','Name','Degree'], as_index=False).first()
print (df)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 None
1 02 James Chemistry None Chem101 None
2 03 Henry Business Bus100 Math100 Acct100
Or if need all unique values without missing values per groups use custom lambda function in GroupBy.agg for processing each column separately by Series.dropna, removed duplicated by dict.fromkeys and last join values by ,:
f = lambda x: ', '.join(dict.fromkeys(x.dropna()))
df = df.groupby(['ID','Name','Degree'], as_index=False).agg(f).replace('', np.nan)
Difference is possible see in changed data:
print (df)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 NaN NaN
1 01 Kathy Biology NaN Chem101 NaN
2 02 James Chemistry NaN Chem101 NaN
3 03 Henry Business Bus100 NaN NaN
4 03 Henry Business NaN Math100 Acct100
5 03 Henry Business NaN Math200 Acct100
df1 = df.groupby(['ID','Name','Degree'], as_index=False).first()
print (df1)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 None
1 02 James Chemistry None Chem101 None
2 03 Henry Business Bus100 Math100 Acct100
f = lambda x: ', '.join(dict.fromkeys(x.dropna()))
df2 = df.groupby(['ID','Name','Degree'], as_index=False).agg(f).replace('', np.nan)
print (df2)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 NaN
1 02 James Chemistry NaN Chem101 NaN
2 03 Henry Business Bus100 Math100, Math200 Acct100
Solution 2:[2]
Here is your data:
df = pd.DataFrame({'ID': ['01', '01', '02', '03', '03', '03'],
'Degree': ['Biology', 'Biology', 'Chemistry', 'Business', 'Business', 'Business'],
'Name': ['Kathy', 'Kathy', 'James', 'Henry', 'Henry', 'Henry'],
'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100']})
You can separate the data frames, remove the NaN values, then rejoin them.
The reduce() function allows the merge to be performed iteratively, without having to merge the data frames one by one.
from functools import reduce
# Separate the data frames
df_student = df[['ID', 'Name', 'Degree']]
df_AM = df[['ID', 'Name', 'AM_Class']]
df_PM = df[['ID', 'Name', 'PM_Class']]
df_OL = df[['ID', 'Name', 'Online_Class']]
# List of data frames
dfs = [df_student, df_AM, df_PM, df_OL]
# Remove all NaNs
for df in dfs:
df.dropna(inplace=True)
# Merge dataframes without the NaNs
df_merged = reduce(lambda left, right: pd.merge(left, right, how='left', on=['ID', 'Name']), dfs)
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 NaN
1 01 Kathy Biology Bio101 Chem101 NaN
2 02 James Chemistry NaN Chem101 NaN
3 03 Henry Business Bus100 Math100 Acct100
4 03 Henry Business Bus100 Math100 Acct100
5 03 Henry Business Bus100 Math100 Acct100
Then you just need to remove the duplicates.
df_merged.drop_duplicates(inplace=True).reset_index()
This is the result:
ID Name Degree AM_Class PM_Class Online_Class
0 01 Kathy Biology Bio101 Chem101 NaN
1 02 James Chemistry NaN Chem101 NaN
2 03 Henry Business Bus100 Math100 Acct100
Solution 3:[3]
You may ffill rows first and then drop duplicates while keeping the last occurrence of duplicates,
df.groupby(['ID']).ffill().drop_duplicates(subset='Name', keep='last')
Solution 4:[4]
See my alternative solution below.
import pandas as pd, numpy as np
df = pd.DataFrame({'ID': ['01', '01', '02', '03', '03', '03'],
'Degree': ['Biology', 'Biology', 'Chemistry', 'Business', 'Business', 'Business'],
'Name': ['Kathy', 'Kathy', 'James', 'Henry', 'Henry', 'Henry'],
'AM_Class': ['Bio101', np.nan, np.nan, 'Bus100', np.nan, np.nan],
'PM_Class': [np.nan, 'Chem101', 'Chem101', np.nan, 'Math100', np.nan],
'Online_Class': [np.nan, np.nan, np.nan, np.nan, np.nan, 'Acct100']})
# merge duplicates, reset index
df = df.fillna('').groupby(['Name','Degree','ID'])[df.columns].agg(lambda x: ','.join(filter(None, x))).reset_index(drop=False)
Output:
Name Degree ID AM_Class PM_Class Online_Class
0 Henry Business 03 Bus100 Math100 Acct100
1 James Chemistry 02 Chem101
2 Kathy Biology 01 Bio101 Chem101
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | ???????? |
| Solution 4 |
