'Merging duplicates in Python (selecting fields)

I would like to combine duplicates in a CSV file that I created using pandas. I identify 2 entries as duplicates if year is the same and title in lowercase with punctuation and whitespace removed is also the same.

See example of a duplicate entry:

Title year cited_by_count pdf_url
"Article title" 2022 34 "Unknown"
"Article title" 2022 "Unknown" www.thisisalink.com/articleid.pdf

Since article title and year are the same, I don't care whose I keep, but I was wondering if there is a way to set conditions on whose field I keep in the case of "cited_by_count" and "pdf_url," since I would want to keep the first entry's cited_by_count but the second entry's pdf_url data value.



Solution 1:[1]

I am sure, someone will have a oneliner for this, but as a beginner, I would do like the below:

df1 = df.loc[df.cited_by_count != "Unknown"].drop(['pdf_url'], axis=1) #slice by value and drop unwanted column
df2 = df.loc[df.pdf_url != "Unknown"].drop(['cited_by_count'], axis=1)
cols = ['Title', 'year']
df3 = pd.merge(df1,df2, on=cols, how='outer') #merge slices back

out:

           Title  year cited_by_count                            pdf_url
0  Article title  2022             34  www.thisisalink.com/articleid.pdf

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 NoobVB