'Python Pandas Dataframe select row by max date in group with aggregate
I have a dataframe as follows:
df = pd.DataFrame({'id': ['A', 'A', 'B', 'B', 'C'],
'date': ['2021-01-01T14:54:42.000Z',
'2021-01-01T14:54:42.000Z',
'2021-01-01T14:55:42.000Z',
'2021-04-01T15:51:42.000Z',
'2021-03-01T15:51:42.000Z'],
'foo': ['apple', 'orange', 'apple', 'banana', 'pepper'],
'count': [3, 2, 4, 2, 1]})
I want to group the dataframe by id and date so that foo and count per date are aggregated lists. I then want to take the row with the most recent date per id.
Expected outcome
id date foo count
A '2021-01-01T14:54:42.000Z' ['apple, orange'] [3, 2]
B '2021-04-01T15:51:42.000Z' ['banana'] [2]
C '2021-03-01T15:51:42.000Z' ['pepper'] [1]
I've tried
df = df.sort_values(['id', 'date'], ascending=(True, False))
test_df = df.groupby(['id', 'date'], as_index=False)['foo', 'count'].agg(list).head(1).reset_index(drop=True)
but this only gives me the first row of the df. .first() gives me a TypeError. Any help is greatly appreciated.
Solution 1:[1]
In your case
df.groupby('id',as_index=False).agg({'date':'max','foo':list,'count':list})
Out[178]:
id date foo count
0 A 2021-01-01T14:54:42.000Z [apple, orange] [3, 2]
1 B 2021-04-01T15:51:42.000Z [apple, banana] [4, 2]
2 C 2021-03-01T15:51:42.000Z [pepper] [1]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
