'How do you return dataframes with lowest date for each ID and highest date for each ID

I have a dataframe made up of synthetic student test results data, this looks like the below:

print(df)

student_ID     test_date    result
76             2021-02-14   60
33             2021-01-12   54
76             2021-11-23   71
76             2021-05-10   78
33             2021-06-09   81
...

The output I'm looking for would look like the below, the oldest and most recent test date and result for each student ID, along with the difference between the two.:

student_ID     test_date     result     test_date2     result2     difference
76             2021-02-14    60         2021-11-23     71          11
33             2021-01-12    54         2021-06-09     81          27
...

I was thinking to create two seperate dataframes, one with records that have the oldest date for each student ID and the other dataframe with the most recent record for each student ID, then concat the two and create an additional column to calculate the difference but I'm unsure if this would the the correct way of doing it. Would there also be a way to order the records by highest difference to lowest, regardless of whether it is a positive or negative difference (10 and -10 would be the same).



Solution 1:[1]

You can use DataFrameGroupBy.idxmax and DataFrameGroupBy.idxmin for get values of index from column result for values by maximal and minimal datetimes with aggregate max and min in GroupBy.agg, last create difference column:

df1 = (df.set_index('result')
         .groupby('student_ID', sort=False)
         .agg(test_date=('test_date','min'),
              result=('test_date','idxmin'),
              test_date2=('test_date','max'),
              result2=('test_date','idxmax'))
         .assign(difference = lambda x: x['result2'].sub(x['result']))
         .reset_index())
print (df1)
   student_ID  test_date  result test_date2  result2  difference
0          76 2021-02-14      60 2021-11-23       71          11
1          33 2021-01-12      54 2021-06-09       81          27

For oldier pandas version here is alternative solution:

df1 = (df.set_index('result')
         .groupby('student_ID', sort=False)['test_date']
         .agg([('test_date','min'),
               ('result','idxmin'),
               ('test_date2','max'),
               ('result2','idxmax')])
         .assign(difference = lambda x: x['result2'].sub(x['result']))
         .reset_index())
print (df1)
   student_ID  test_date  result test_date2  result2  difference
0          76 2021-02-14      60 2021-11-23       71          11
1          33 2021-01-12      54 2021-06-09       81          27

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1