'Merging duplicates in Python (selecting fields)
I would like to combine duplicates in a CSV file that I created using pandas. I identify 2 entries as duplicates if year is the same and title in lowercase with punctuation and whitespace removed is also the same.
See example of a duplicate entry:
| Title | year | cited_by_count | pdf_url |
|---|---|---|---|
| "Article title" | 2022 | 34 | "Unknown" |
| "Article title" | 2022 | "Unknown" | www.thisisalink.com/articleid.pdf |
Since article title and year are the same, I don't care whose I keep, but I was wondering if there is a way to set conditions on whose field I keep in the case of "cited_by_count" and "pdf_url," since I would want to keep the first entry's cited_by_count but the second entry's pdf_url data value.
Solution 1:[1]
I am sure, someone will have a oneliner for this, but as a beginner, I would do like the below:
df1 = df.loc[df.cited_by_count != "Unknown"].drop(['pdf_url'], axis=1) #slice by value and drop unwanted column
df2 = df.loc[df.pdf_url != "Unknown"].drop(['cited_by_count'], axis=1)
cols = ['Title', 'year']
df3 = pd.merge(df1,df2, on=cols, how='outer') #merge slices back
out:
Title year cited_by_count pdf_url
0 Article title 2022 34 www.thisisalink.com/articleid.pdf
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | NoobVB |
