'Duplicated rows when merging dataframes in Python
I am currently merging two dataframes with an outer join. However, after merging, I see all the rows are duplicated even when the columns that I merged upon contain the same values.
Specifically, I have the following code.
merged_df = pd.merge(df1, df2, on=['email_address'], how='inner')
Here are the two dataframes and the results.
df1
          email_address    name   surname
0  [email protected]    john     smith
1  [email protected]    john     smith
2       [email protected]   elvis   presley
df2
          email_address    street  city
0  [email protected]   street1    NY
1  [email protected]   street1    NY
2       [email protected]   street2    LA
merged_df
          email_address    name   surname    street  city
0  [email protected]    john     smith   street1    NY
1  [email protected]    john     smith   street1    NY
2  [email protected]    john     smith   street1    NY
3  [email protected]    john     smith   street1    NY
4       [email protected]   elvis   presley   street2    LA
5       [email protected]   elvis   presley   street2    LA
My question is, shouldn't it be like this?
This is how I would like my merged_df to be like.
          email_address    name   surname    street  city
0  [email protected]    john     smith   street1    NY
1  [email protected]    john     smith   street1    NY
2       [email protected]   elvis   presley   street2    LA
Are there any ways I can achieve this?
Solution 1:[1]
list_2_nodups = list_2.drop_duplicates()
pd.merge(list_1 , list_2_nodups , on=['email_address'])
The duplicate rows are expected.  Each john smith in list_1 matches with each john smith in list_2.  I had to drop the duplicates in one of the lists.  I chose list_2.
Solution 2:[2]
DO NOT drop duplicates BEFORE the merge, but after!
Best solution is do the merge and then drop the duplicates.
In your case:
merged_df = pd.merge(df1, df2, on=['email_address'], how='inner') merged_df.drop_duplicates(subset=['email_address'], keep='first', inplace=True, ignore_index=True)
Hope I help!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source | 
|---|---|
| Solution 1 | |
| Solution 2 | Rafael Amaral | 

