'Effective way to iterate and calculate based on dates and ID conditions in large dataset using Pyspark/Databricks

I don't have experience in Pyspark, so if anyone could help me with the following issue:

I have the following Spark dataframe:

| ID_user | date_contract | total_debt | 
| 2564541 | 2015-08-22    |64          | 
| 2564541 | 2019-05-16    |100         |
| 2564541 | 2020-08-15    |200         |
| 2564541 | 2021-09-17    |150         |
| 3000000 | 2014-08-22    |84          |
| 3000000 | 2015-08-23    |100         |
| 3000000 | 2016-08-24    |200         |

As you can see above, there are users with more than one contract and they can do just one contract per day. So the goal is to calculate total debt average, for each user (ID_user), of all contracts agreed before the current contract.

So the user 2564541 did a contract on 2021-09-17, so the average debt based on all contracts agreed for him before that contract (in 2020/2019/2015) is mean(200,100,64) = 121,33.

I think some kind of iteration is needed because I have to do the same for each id_user and for each date of contract.

Expected output:

| ID_user | date_contract | total_debt | avg_debt_before |
| 2564541 | 2015-08-22    |64          | 0               |
| 2564541 | 2019-05-16    |100         | 64              |
| 2564541 | 2020-08-15    |200         | 82              |
| 2564541 | 2021-09-17    |150         | 121,3           |
| 3000000 | 2014-08-22    |84          | 0               |
| 3000000 | 2015-08-23    |30          | 84              |
| 3000000 | 2016-08-24    |50          | 57              |

I've already tried using rdd.toLocalIterator(), but no success. I've spent almost all week searching for an answer here too, so any help or tips would be great!



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source