'When your pandas df has a column of list type removing duplicates from each item
My dataframe has a column of lists and looks like this.
id source
0 3 [nan,nan,nan]
1 5 [nan,foo,foo,nan,foo]
2 7 [ham,nan,ham,nan]
3 9 [foo,foo]
I need to remove duplicates from each list. So I am looking from something like below.
id source
0 3 [nan]
1 5 [nan,foo]
2 7 [ham,nan]
3 9 [foo]
I tried to use the following code which didn't work. What do you recommend?
df['source'] = list(set(df['source']))
Solution 1:[1]
You can .explode on source column, .drop_duplicates and .groupby back:
df = (
df.explode("source")
.drop_duplicates(["id", "source"])
.groupby("id", as_index=False)
.agg(list)
)
print(df)
Prints:
id source
0 3 [nan]
1 5 [nan, foo]
2 7 [ham, nan]
3 9 [foo]
Or convert the list to pd.Series, drop duplicates and convert back to list:
df["source"] = df["source"].apply(lambda x: [*pd.Series(x).drop_duplicates()])
print(df)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Andrej Kesely |
