'How do you return dataframes with lowest date for each ID and highest date for each ID
I have a dataframe made up of synthetic student test results data, this looks like the below:
print(df)
student_ID test_date result
76 2021-02-14 60
33 2021-01-12 54
76 2021-11-23 71
76 2021-05-10 78
33 2021-06-09 81
...
The output I'm looking for would look like the below, the oldest and most recent test date and result for each student ID, along with the difference between the two.:
student_ID test_date result test_date2 result2 difference
76 2021-02-14 60 2021-11-23 71 11
33 2021-01-12 54 2021-06-09 81 27
...
I was thinking to create two seperate dataframes, one with records that have the oldest date for each student ID and the other dataframe with the most recent record for each student ID, then concat the two and create an additional column to calculate the difference but I'm unsure if this would the the correct way of doing it. Would there also be a way to order the records by highest difference to lowest, regardless of whether it is a positive or negative difference (10 and -10 would be the same).
Solution 1:[1]
You can use DataFrameGroupBy.idxmax and DataFrameGroupBy.idxmin for get values of index from column result for values by maximal and minimal datetimes with aggregate max and min in GroupBy.agg, last create difference column:
df1 = (df.set_index('result')
.groupby('student_ID', sort=False)
.agg(test_date=('test_date','min'),
result=('test_date','idxmin'),
test_date2=('test_date','max'),
result2=('test_date','idxmax'))
.assign(difference = lambda x: x['result2'].sub(x['result']))
.reset_index())
print (df1)
student_ID test_date result test_date2 result2 difference
0 76 2021-02-14 60 2021-11-23 71 11
1 33 2021-01-12 54 2021-06-09 81 27
For oldier pandas version here is alternative solution:
df1 = (df.set_index('result')
.groupby('student_ID', sort=False)['test_date']
.agg([('test_date','min'),
('result','idxmin'),
('test_date2','max'),
('result2','idxmax')])
.assign(difference = lambda x: x['result2'].sub(x['result']))
.reset_index())
print (df1)
student_ID test_date result test_date2 result2 difference
0 76 2021-02-14 60 2021-11-23 71 11
1 33 2021-01-12 54 2021-06-09 81 27
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
