'Pandas - Compare each row with one another across dataframe and list the amount of duplicate values
I would like to add a column to an existing dataframe that compares every row in the dataframe against each other and list the amount of duplicate values. (I don't want to remove any of the rows, even if they are entirely duplicated with another row)
The duplicates column should show something like this:
Name Name1 Name2 Name3 Name4 Duplicates
Mark Doug Jim Tom Alex 5
Mark Doug Jim Tom Peter 4
Mark Jim Doug Tom Alex 5
Josh Jesse Jim Tom Alex 3
Adam Cam Max Matt James 0
Solution 1:[1]
IIUC, you can convert your dataframe to an array of sets, then use numpy broadcasting to compare each combination (except the diagonal) and get the max intersection:
names = df.agg(set, axis=1)
a = df.agg(set, axis=1).to_numpy()
b = a&a[:,None]
np.fill_diagonal(b, {})
df['Duplicates'] = [max(map(len, x)) for x in b]
output:
Name Name1 Name2 Name3 Name4 Duplicates
0 Mark Doug Jim Tom Alex 5
1 Mark Doug Jim Tom Peter 4
2 Mark Jim Doug Tom Alex 5
3 Josh Jesse Jim Tom Alex 3
4 Adam Cam Max Matt James 0
Solution 2:[2]
something you can use is the DataFrame's handy dandy groupby method. This will allow you to group your data(frame) by a specified subset of attributes/columns, and use the size() or count() method to get the number of rows in each group:
# group pandas.DataFrame df by Names 1-4 and summarize with count
duplicates = df.groupby(['name_1','name_2','name_3','name_4']).size().sort_values(ascending=False).reset_index(name='duplicates')
Note: I use the reset_index() method here to return a DataFrame instead of a Series. I use the sort_values() method to order the dataframe by the size (in this case, the number of rows in each group).
Solution 3:[3]
This is something I came up with, not an optimal solution - however, works!
unique_values_dict={} # dictionary to store the unique values of each column
columns=df.columns # all columns are kept in the columns variable
for c in columns: # for each column, find the distribution and store in a dictionary
unique_values=[ (i,j) for i,j in df[c].value_counts().items()]
unique_values_dict[c]=unique_values
df['duplicates']=0 # add a column called 'duplicates' with 0 as the default value
def count_duplicates(row):
'''
row: Each row of the dataframe
returns: sum of duplicate items in the row, across the df
'''
dups=row['duplicates']
for c in columns:
#print(c,row[c],unique_values_dict[c][1][0])
if(row[c]==unique_values_dict[c][1][0]):
dups+=unique_values_dict[c][1][1]
#print(dups)
return dups
df['duplicates']=df.apply(lambda row:count_duplicates(row),axis=1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | mozway |
| Solution 2 | matsuninja |
| Solution 3 | pnv |
