'How to calculate future values based on current value in a column in pyspark dataframe?
my question is as below. I'm trying to calculate future value, let's say backlog value in pyspark dataframe.
My smaple data frame is:
Task start_date end_date Total_salary
Task1 2022-01-01 01-04-2022 500
Task2 2022-03-01 2022-06-01 400
Task3 2019-11-01 2020-01-01 300
Task3 2021-11-01 2022-04-01 600
Expected output: I need to calculate the backlog from this month to until maximum date in end_date column. How I get how much pay for one months is: Total_salary/Months between start_date and end_date I need below output since this Jan/2022. I need this in separate datframe which have only below two columns.
date Total_backlog
2022-01-31 #(Task1: 500-100) + (Task2: 300 ( because it didn't
#started yet)) + (Task3: 0)( it's already finished)) +
#(Task4: 600 - 300)
#So total is : 400 + 400+ 0 + 300 = 1100
2022-02-28 800
2022-03-31 .....
.......
2022-06-31 .....
( This is the max date in end_date, but actual data set this date is more than that date)
I don't know how to loop over pyspark dataframe. Please can someone help me?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
