'Python Pandas Dataframe select row by max date in group with aggregate

I have a dataframe as follows:


    df = pd.DataFrame({'id': ['A', 'A', 'B', 'B', 'C'],
                       'date': ['2021-01-01T14:54:42.000Z', 
                                '2021-01-01T14:54:42.000Z',
                                '2021-01-01T14:55:42.000Z', 
                                '2021-04-01T15:51:42.000Z',
                                '2021-03-01T15:51:42.000Z'],
                       'foo': ['apple', 'orange', 'apple', 'banana', 'pepper'],
                       'count': [3, 2, 4, 2, 1]})    

I want to group the dataframe by id and date so that foo and count per date are aggregated lists. I then want to take the row with the most recent date per id.

Expected outcome

    id   date                        foo                 count
     A   '2021-01-01T14:54:42.000Z'  ['apple, orange']   [3, 2]
     B   '2021-04-01T15:51:42.000Z'  ['banana']          [2]
     C   '2021-03-01T15:51:42.000Z'  ['pepper']          [1]
  

I've tried

df = df.sort_values(['id', 'date'], ascending=(True, False))
test_df = df.groupby(['id', 'date'], as_index=False)['foo', 'count'].agg(list).head(1).reset_index(drop=True)

but this only gives me the first row of the df. .first() gives me a TypeError. Any help is greatly appreciated.



Solution 1:[1]

In your case

df.groupby('id',as_index=False).agg({'date':'max','foo':list,'count':list})
Out[178]: 
  id                      date              foo   count
0  A  2021-01-01T14:54:42.000Z  [apple, orange]  [3, 2]
1  B  2021-04-01T15:51:42.000Z  [apple, banana]  [4, 2]
2  C  2021-03-01T15:51:42.000Z         [pepper]     [1]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1