'How to calculate future values based on current value in a column in pyspark dataframe?

my question is as below. I'm trying to calculate future value, let's say backlog value in pyspark dataframe.

My smaple data frame is:

 Task     start_date      end_date  Total_salary   
Task1     2022-01-01    01-04-2022           500                
Task2     2022-03-01    2022-06-01           400                                
Task3     2019-11-01    2020-01-01           300   
Task3     2021-11-01    2022-04-01           600                       

Expected output: I need to calculate the backlog from this month to until maximum date in end_date column. How I get how much pay for one months is: Total_salary/Months between start_date and end_date I need below output since this Jan/2022. I need this in separate datframe which have only below two columns.

date              Total_backlog
2022-01-31        #(Task1: 500-100) +  (Task2: 300 ( because it didn't 
                  #started yet)) + (Task3: 0)( it's already finished)) + 
                  #(Task4: 600 - 300)  
                  #So total is : 400 + 400+ 0 + 300 = 1100

2022-02-28        800

2022-03-31        .....

.......
2022-06-31        .....

( This is the max date in end_date, but actual data set this date is more than that date)

I don't know how to loop over pyspark dataframe. Please can someone help me?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source