'I would like to calculate the difference between two floats datatypes and hours within respective columns
I would like to calculate the difference between enrollment fees that were paid within 24 hours and is less than 10000 USD each but when combined is greater than 10000 USD that was paid by specific/unqiue students
See below for columns/data dictionary in my dataset for clarity
- Student ID (string dataype): This is a unique ID that is specific to each student, for students that enrolled in more than more course, they will appear in multiply rows under the same ID
- Enrollment Date (datetime datatype) : The is the date of enrollment for different courses
- Enrollment Time (datetime dataype) : This is the time of enrollment for the different courses
- Enrollment Fees (Float datatype): This is amount paid to enroll in a course
I firstly want to identify students that enrolled in more than one course using the count method as in my notebook but I don't know where to go from here.
Solution 1:[1]
I copied your 'student_information' table and added some rows for the sake of this answer.
import pandas as pd
#Converted your table from the image you provide to excel.
student_information = pd.read_excel('/content/student_data_notebook picture.xlsx', 'Sheet1')
student_information.head(10)
| index | enrollment_date | enrollment_time_number | enrolment_fees | student_id | enrollment_time | total_fee_difference_24_hours | total_fee_difference_24_hours_over_10000 |
|---|---|---|---|---|---|---|---|
| 0 | 2019-08-01 00:00:00 | 153443 | 28000 | 2286141 | 1970-01-02 18:37:23 | true | true |
| 1 | 2019-08-01 00:00:00 | 153443 | 10000 | 2456977 | 1970-01-02 18:37:23 | false | false |
| 2 | 2019-08-01 00:00:00 | 132351 | 23690 | 3182054 | 1970-01-02 12:45:51 | false | false |
| 3 | 2019-08-02 00:00:00 | 154452 | 20000 | 3193091 | 1970-01-02 18:54:12 | false | false |
| 4 | 2019-08-06 00:00:00 | 134746 | 33436 | 2911297 | 1970-01-02 13:25:46 | false | false |
| 5 | 2019-08-01 00:00:00 | 153443 | 18000 | 2286141 | 1970-01-02 21:37:23 | true | true |
| 6 | 2019-08-03 00:00:00 | 153443 | 31000 | 2286141 | 1970-01-02 23:37:23 | true | true |
| 7 | 2019-08-04 00:00:00 | 153443 | 45000 | 2286141 | 1970-01-02 13:37:23 | true | true |
| 8 | 2019-08-01 00:00:00 | 153443 | 10000 | 2456977 | 1970-01-05 18:37:23 | false | false |
| 9 | 2019-08-02 00:00:00 | 153443 | 25000 | 2456977 | 1970-01-08 18:38:23 | false | false |
#Checked the data type of each column.
student_information.dtypes
enrollment_date datetime64[ns] enrollment_time_number int64 enrolment_fees int64 student_id int64 enrollment_time datetime64[ns] dtype: object
#Created a summary table to add to the number of enrolled courses for each student.
student_summary = student_information.groupby('student_id')['enrolment_fees'].count().reset_index()
student_summary.rename(columns={'enrolment_fees': 'number_of_enrolled_courses'}, inplace = True)
student_summary
| index | student_id | number_of_enrolled_courses |
|---|---|---|
| 0 | 2286141 | 4 |
| 1 | 2456977 | 3 |
| 2 | 2911297 | 1 |
| 3 | 3182054 | 1 |
| 4 | 3193091 | 1 |
#Added the total fees paid for each student id and a column to check if its over 100000.
student_summary['total_fees'] = student_information.groupby('student_id')['enrolment_fees'].sum().reset_index()['enrolment_fees']
student_summary['over_100000_in_fees'] = student_summary['total_fees'] > 100000
student_summary
| index | student_id | number_of_enrolled_courses | total_fees | over_100000_in_fees |
|---|---|---|---|---|
| 0 | 2286141 | 4 | 122000 | true |
| 1 | 2456977 | 3 | 45000 | false |
| 2 | 2911297 | 1 | 33436 | false |
| 3 | 3182054 | 1 | 23690 | false |
| 4 | 3193091 | 1 | 20000 | false |
'''
A column is added to the 'student_information' table to represent the difference in fees within 24 hours for each student, relative to each course they are enrolled in and if the sum is smaller than 10000.
'''
length_df = student_information.shape[0]
list_total_fee_difference = []
for row in range(length_df):
student_id = student_information.iloc[row]['student_id']
enrollment_time = student_information.iloc[row]['enrollment_time']
enrollment_fees = student_information.iloc[row]['enrolment_fees']
low_time_limit = enrollment_time - pd.Timedelta('1 Day')
upper_time_limit = enrollment_time + pd.Timedelta('1 Day')
total_fee_difference = 0
for row_2 in range(length_df):
enrollment_time_2 = student_information.iloc[row_2]['enrollment_time']
student_id_2 = enrollment_time = student_information.iloc[row_2]['student_id']
enrollment_fees_2 = student_information.iloc[row_2]['enrolment_fees']
if ((student_id == student_id_2) and
(enrollment_time_2 >= low_time_limit) and
(upper_time_limit >= enrollment_time_2)):
total_fee_difference += abs(enrollment_fees - enrollment_fees_2)
list_total_fee_difference.append(total_fee_difference)
series_total_fee_difference = pd.Series(list_total_fee_difference)
student_information['total_fee_difference_24_hours_over_10000'] = series_total_fee_difference.values > 10000
student_information
| index | enrollment_date | enrollment_time_number | enrolment_fees | student_id | enrollment_time | total_fee_difference_24_hours_under_10000 |
|---|---|---|---|---|---|---|
| 0 | 2019-08-01 00:00:00 | 153443 | 28000 | 2286141 | 1970-01-02 18:37:23 | false |
| 1 | 2019-08-01 00:00:00 | 153443 | 10000 | 2456977 | 1970-01-02 18:37:23 | true |
| 2 | 2019-08-01 00:00:00 | 132351 | 23690 | 3182054 | 1970-01-02 12:45:51 | true |
| 3 | 2019-08-02 00:00:00 | 154452 | 20000 | 3193091 | 1970-01-02 18:54:12 | true |
| 4 | 2019-08-06 00:00:00 | 134746 | 33436 | 2911297 | 1970-01-02 13:25:46 | true |
| 5 | 2019-08-01 00:00:00 | 153443 | 18000 | 2286141 | 1970-01-02 21:37:23 | false |
| 6 | 2019-08-03 00:00:00 | 153443 | 31000 | 2286141 | 1970-01-02 23:37:23 | false |
| 7 | 2019-08-04 00:00:00 | 153443 | 45000 | 2286141 | 1970-01-02 13:37:23 | false |
| 8 | 2019-08-01 00:00:00 | 153443 | 10000 | 2456977 | 1970-01-05 18:37:23 | true |
| 9 | 2019-08-02 00:00:00 | 153443 | 25000 | 2456977 | 1970-01-08 18:38:23 | true |
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ahmedopolis |

