'Pyspark DataFrame Grouping by item that doesn't belong to the group

I am new to pyspark and am stuck in a situation could you please help me in obtaining a result in a manner as:

customer_id item_id amount
1 tx1 15
1 tx2 10
1 tx3 14
2 tx1 15
2 tx4 12
3 tx2 10
2 tx6 43
4 tx4 12
5 tx8 76
6 tx6 43
5 tx6 43
3 tx6 43

And want to know for each item:

  • The count of customers that didn't purchase this item
  • The sum of the amount of items that are not the customers of this item.

So the final table would look like:

item_id target_cust taget_amount
tx1 4 227
tx2 4 201
tx3 5 297
tx4 4 --
tx6 3 --
tx8 5 --

please help me in getting a similar output, any suggestions in the direction would be great



Solution 1:[1]

First group by customer_id and get list of purchased item_id with the associated amount like this:

import pyspark.sql.functions as F

items_by_customer_df = df.groupBy("customer_id").agg(
    F.collect_set("item_id").alias("items"),
    F.sum("amount").alias("target_amount")

)

items_by_customer_df.show()
#+-----------+---------------+-------------+
#|customer_id|items          |target_amount|
#+-----------+---------------+-------------+
#|1          |[tx1, tx2, tx3]|39           |
#|2          |[tx1, tx6, tx4]|70           |
#|3          |[tx2, tx6]     |53           |
#|5          |[tx6, tx8]     |119          |
#|4          |[tx4]          |12           |
#|6          |[tx6]          |43           |
#+-----------+---------------+-------------+

Now, join this grouped dataframe with distinct item_ids from original df using negation of array_contains as condition, then group by item_id and do aggregations count(customer_id) + sum(amount):

result = df.select("item_id").distinct().join(
    items_by_customer_df,
    ~F.array_contains("items", F.col("item_id"))
).groupBy("item_id").agg(
    F.count("customer_id").alias("target_cust"),
    F.sum("target_amount").alias("target_amount")
)

result.show()
#+-------+-----------+-------------+
#|item_id|target_cust|target_amount|
#+-------+-----------+-------------+
#|    tx2|          4|          244|
#|    tx4|          4|          254|
#|    tx1|          4|          227|
#|    tx8|          5|          217|
#|    tx3|          5|          297|
#|    tx6|          2|           51|
#+-------+-----------+-------------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 blackbishop