'Identifying List Difference between Consecutive Rows for Entire Panda DataFrame Column
I have a dataframe of multiple columns and thousands of rows where one of the columns (called Customers) contains list of customer Ids:
| Customers |
|---------------------|
|[A, B, C, D, E, F, G]|
|[A, C, D, E, F, G] |
|[A, C, E, G] |
|[G] |
I am interested to find the element differences between each row and its previous row for the Customers column, so my expected outcome should be:
|Customer Difference |
|---------------------|
|[] |
|[B] |
|[D, F] |
|[A, C, E] |
The Customer column is sorted meaning every row will contain all elements of its below row
I know I can use sets to find two list difference but not sure what would be a pythonic way to do this on the entire column.
Thanks
Solution 1:[1]
Assuming you have a Series of lists, you could use a list comprehension with set operations:
df['diff'] = [sorted(set(b).difference(a)) # the sorting step is optional
for a,b in zip(df['Customers'],
df['Customers'].shift(fill_value=[[]]))
]
output:
Customers diff
0 [A, B, C, D, E, F, G] []
1 [A, C, D, E, F, G] [B]
2 [A, C, E, G] [D, F]
3 [G] [A, C, E]
alternative
You could also convert to set and use diff directly:
df['diff'] = df['Customers'].apply(set).diff(-1).shift(fill_value={})
NB. this approach is fancy but quite slower than the list comprehension
output:
Customers diff
0 [A, B, C, D, E, F, G] {}
1 [A, C, D, E, F, G] {B}
2 [A, C, E, G] {F, D}
3 [G] {E, A, C}
used input:
df = pd.DataFrame({'Customers': [['A', 'B', 'C', 'D', 'E', 'F', 'G'],
['A', 'C', 'D', 'E', 'F', 'G'],
['A', 'C', 'E', 'G'],
['G']]
})
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
