'Dealing with duplicate values but with some different values

How do we deal with duplicate values in a dataframe that has some values that are different. For example, in this dataframe below we have similar rows but some values like the Male, female, unknown and total (last 4 columns) are different(these represent the number of owners by gender). Do we sum the rows or take mean/median? Or can we just delete the duplicates? dataset



Solution 1:[1]

If you know the columns that you want to consider, instead of using .drop_duplicates on the whole dataframe, just use it to select unique rows in that subset of columns:

>>> df = pd.DataFrame({
    "unique_values": [1, 2, 3, 4, 5],
    "column1": [1, 2, 3, 1, 2],
    "column2": [2, 2, 2, 2, 2],
})
   unique_values  column1  column2
0              1        1        2
1              2        2        2
2              3        3        2
3              4        1        2
4              5        2        2

In this example, we only want to consider the duplicate rows by looking at the "column1" and "column2" columns. We can .drop_duplicates on those columns only and then use the result to query our original dataframe:

>>> unique_rows = df[["column1", "column2"]].drop_duplicates()
>>> output_df = df.loc[unique_rows.index]
   unique_values  column1  column2
0              1        1        2
1              2        2        2
2              3        3        2

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 aaossa