'I would like to calculate the difference between two floats datatypes and hours within respective columns

I would like to calculate the difference between enrollment fees that were paid within 24 hours and is less than 10000 USD each but when combined is greater than 10000 USD that was paid by specific/unqiue students

See below for columns/data dictionary in my dataset for clarity

  • Student ID (string dataype): This is a unique ID that is specific to each student, for students that enrolled in more than more course, they will appear in multiply rows under the same ID
  • Enrollment Date (datetime datatype) : The is the date of enrollment for different courses
  • Enrollment Time (datetime dataype) : This is the time of enrollment for the different courses
  • Enrollment Fees (Float datatype): This is amount paid to enroll in a course

I firstly want to identify students that enrolled in more than one course using the count method as in my notebook but I don't know where to go from here.



Solution 1:[1]

I copied your 'student_information' table and added some rows for the sake of this answer.

import pandas as pd
#Converted your table from the image you provide to excel.
student_information = pd.read_excel('/content/student_data_notebook picture.xlsx', 'Sheet1')
student_information.head(10)
index enrollment_date enrollment_time_number enrolment_fees student_id enrollment_time total_fee_difference_24_hours total_fee_difference_24_hours_over_10000
0 2019-08-01 00:00:00 153443 28000 2286141 1970-01-02 18:37:23 true true
1 2019-08-01 00:00:00 153443 10000 2456977 1970-01-02 18:37:23 false false
2 2019-08-01 00:00:00 132351 23690 3182054 1970-01-02 12:45:51 false false
3 2019-08-02 00:00:00 154452 20000 3193091 1970-01-02 18:54:12 false false
4 2019-08-06 00:00:00 134746 33436 2911297 1970-01-02 13:25:46 false false
5 2019-08-01 00:00:00 153443 18000 2286141 1970-01-02 21:37:23 true true
6 2019-08-03 00:00:00 153443 31000 2286141 1970-01-02 23:37:23 true true
7 2019-08-04 00:00:00 153443 45000 2286141 1970-01-02 13:37:23 true true
8 2019-08-01 00:00:00 153443 10000 2456977 1970-01-05 18:37:23 false false
9 2019-08-02 00:00:00 153443 25000 2456977 1970-01-08 18:38:23 false false
#Checked the data type of each column.
student_information.dtypes

enrollment_date datetime64[ns] enrollment_time_number int64 enrolment_fees int64 student_id int64 enrollment_time datetime64[ns] dtype: object

#Created a summary table to add to the number of enrolled courses for each student.
student_summary = student_information.groupby('student_id')['enrolment_fees'].count().reset_index()
student_summary.rename(columns={'enrolment_fees': 'number_of_enrolled_courses'}, inplace = True)
student_summary
index student_id number_of_enrolled_courses
0 2286141 4
1 2456977 3
2 2911297 1
3 3182054 1
4 3193091 1
#Added the total fees paid for each student id and a column to check if its over 100000.
student_summary['total_fees'] = student_information.groupby('student_id')['enrolment_fees'].sum().reset_index()['enrolment_fees']
student_summary['over_100000_in_fees'] = student_summary['total_fees'] > 100000
student_summary
index student_id number_of_enrolled_courses total_fees over_100000_in_fees
0 2286141 4 122000 true
1 2456977 3 45000 false
2 2911297 1 33436 false
3 3182054 1 23690 false
4 3193091 1 20000 false
'''
A column is added to the 'student_information' table to represent the difference in fees within 24 hours for each student, relative to each course they are enrolled in and if the sum is smaller than 10000.
''' 
length_df = student_information.shape[0]
list_total_fee_difference = []
for row in range(length_df):
    student_id = student_information.iloc[row]['student_id']
    enrollment_time = student_information.iloc[row]['enrollment_time']
    enrollment_fees = student_information.iloc[row]['enrolment_fees']
    low_time_limit = enrollment_time - pd.Timedelta('1 Day')
    upper_time_limit = enrollment_time + pd.Timedelta('1 Day')
    total_fee_difference = 0
    for row_2 in range(length_df):
        enrollment_time_2 = student_information.iloc[row_2]['enrollment_time']
        student_id_2 = enrollment_time = student_information.iloc[row_2]['student_id']
        enrollment_fees_2 = student_information.iloc[row_2]['enrolment_fees']
        if ((student_id == student_id_2) and 
            (enrollment_time_2 >= low_time_limit) and 
            (upper_time_limit >= enrollment_time_2)):
            total_fee_difference += abs(enrollment_fees - enrollment_fees_2)
    list_total_fee_difference.append(total_fee_difference)
series_total_fee_difference = pd.Series(list_total_fee_difference)
student_information['total_fee_difference_24_hours_over_10000'] = series_total_fee_difference.values > 10000
student_information
index enrollment_date enrollment_time_number enrolment_fees student_id enrollment_time total_fee_difference_24_hours_under_10000
0 2019-08-01 00:00:00 153443 28000 2286141 1970-01-02 18:37:23 false
1 2019-08-01 00:00:00 153443 10000 2456977 1970-01-02 18:37:23 true
2 2019-08-01 00:00:00 132351 23690 3182054 1970-01-02 12:45:51 true
3 2019-08-02 00:00:00 154452 20000 3193091 1970-01-02 18:54:12 true
4 2019-08-06 00:00:00 134746 33436 2911297 1970-01-02 13:25:46 true
5 2019-08-01 00:00:00 153443 18000 2286141 1970-01-02 21:37:23 false
6 2019-08-03 00:00:00 153443 31000 2286141 1970-01-02 23:37:23 false
7 2019-08-04 00:00:00 153443 45000 2286141 1970-01-02 13:37:23 false
8 2019-08-01 00:00:00 153443 10000 2456977 1970-01-05 18:37:23 true
9 2019-08-02 00:00:00 153443 25000 2456977 1970-01-08 18:38:23 true

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ahmedopolis