'Pandas : remove duplicated rows using mode
Using a pandas.Dataframe, how should I remove duplicated (based on multiple columns) rows using the mode of another column ?
import pandas as pd
df = pd.DataFrame(
data={
"col_1": [0, 0, 0, 0, 1, 1, 1, 1],
"col_2": [1, 1, 1, 1, 2, 2, 2, 2],
"col_3": [5, 5, 0, 1, 8, 8, 0, 1],
"another_column": [0, 0, 0, 0, 0, 0, 0, 0],
}
)
# the following line shows the correct answer but doesn't return original dataframe
# with only the two unique rows
print(df.groupby(by=["col_1", "col_2"])["col_3"].agg(lambda x: x.mode()[0]))
Solution 1:[1]
Use GroupBy.transform and compare original column col_3 in boolean indexing:
s = df.groupby(by=["col_1", "col_2"])["col_3"].transform(lambda x: x.mode()[0])
df1 = df[df['col_3'].eq(s)]
print (df1)
col_1 col_2 col_3 another_column
0 0 1 5 0
1 0 1 5 0
4 1 2 8 0
5 1 2 8 0
If need first row per groups:
s = df.groupby(by=["col_1", "col_2"])["col_3"].transform(lambda x: x.mode()[0])
df1 = df[df['col_3'].eq(s)].drop_duplicates(["col_1", "col_2"])
print (df1)
col_1 col_2 col_3 another_column
0 0 1 5 0
4 1 2 8 0
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
