'Dealing with duplicate values but with some different values
How do we deal with duplicate values in a dataframe that has some values that are different. For example, in this dataframe below we have similar rows but some values like the Male, female, unknown and total (last 4 columns) are different(these represent the number of owners by gender). Do we sum the rows or take mean/median? Or can we just delete the duplicates? dataset
Solution 1:[1]
If you know the columns that you want to consider, instead of using .drop_duplicates on the whole dataframe, just use it to select unique rows in that subset of columns:
>>> df = pd.DataFrame({
"unique_values": [1, 2, 3, 4, 5],
"column1": [1, 2, 3, 1, 2],
"column2": [2, 2, 2, 2, 2],
})
unique_values column1 column2
0 1 1 2
1 2 2 2
2 3 3 2
3 4 1 2
4 5 2 2
In this example, we only want to consider the duplicate rows by looking at the "column1" and "column2" columns. We can .drop_duplicates on those columns only and then use the result to query our original dataframe:
>>> unique_rows = df[["column1", "column2"]].drop_duplicates()
>>> output_df = df.loc[unique_rows.index]
unique_values column1 column2
0 1 1 2
1 2 2 2
2 3 3 2
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | aaossa |
