'Identifying List Difference between Consecutive Rows for Entire Panda DataFrame Column

I have a dataframe of multiple columns and thousands of rows where one of the columns (called Customers) contains list of customer Ids:

|      Customers      |    
|---------------------|
|[A, B, C, D, E, F, G]|
|[A, C, D, E, F, G]   |
|[A, C, E, G]         |
|[G]                  |

I am interested to find the element differences between each row and its previous row for the Customers column, so my expected outcome should be:

|Customer Difference  |    
|---------------------|
|[]                   |
|[B]                  |
|[D, F]               |
|[A, C, E]            |

The Customer column is sorted meaning every row will contain all elements of its below row

I know I can use sets to find two list difference but not sure what would be a pythonic way to do this on the entire column.

Thanks

python-3.x pandas

Solution 1:^[1]

Assuming you have a Series of lists, you could use a list comprehension with set operations:

df['diff'] = [sorted(set(b).difference(a))  # the sorting step is optional
              for a,b in zip(df['Customers'],
                             df['Customers'].shift(fill_value=[[]]))
             ]

output:

               Customers       diff
0  [A, B, C, D, E, F, G]         []
1     [A, C, D, E, F, G]        [B]
2           [A, C, E, G]     [D, F]
3                    [G]  [A, C, E]

alternative

You could also convert to set and use diff directly:

df['diff'] = df['Customers'].apply(set).diff(-1).shift(fill_value={})

NB. this approach is fancy but quite slower than the list comprehension

output:

               Customers       diff
0  [A, B, C, D, E, F, G]         {}
1     [A, C, D, E, F, G]        {B}
2           [A, C, E, G]     {F, D}
3                    [G]  {E, A, C}

used input:

df = pd.DataFrame({'Customers': [['A', 'B', 'C', 'D', 'E', 'F', 'G'],
                                 ['A', 'C', 'D', 'E', 'F', 'G'],
                                 ['A', 'C', 'E', 'G'],
                                 ['G']]
                  })

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Identifying List Difference between Consecutive Rows for Entire Panda DataFrame Column

Solution 1:[1]

alternative

Sources

Related Questions

Solution 1:^[1]