'Pyspark DataFrame Grouping by item that doesn't belong to the group
I am new to pyspark and am stuck in a situation could you please help me in obtaining a result in a manner as:
| customer_id | item_id | amount |
|---|---|---|
| 1 | tx1 | 15 |
| 1 | tx2 | 10 |
| 1 | tx3 | 14 |
| 2 | tx1 | 15 |
| 2 | tx4 | 12 |
| 3 | tx2 | 10 |
| 2 | tx6 | 43 |
| 4 | tx4 | 12 |
| 5 | tx8 | 76 |
| 6 | tx6 | 43 |
| 5 | tx6 | 43 |
| 3 | tx6 | 43 |
And want to know for each item:
- The count of
customersthat didn't purchase thisitem - The sum of the
amountofitemsthat are not thecustomersof thisitem.
So the final table would look like:
| item_id | target_cust | taget_amount |
|---|---|---|
| tx1 | 4 | 227 |
| tx2 | 4 | 201 |
| tx3 | 5 | 297 |
| tx4 | 4 | -- |
| tx6 | 3 | -- |
| tx8 | 5 | -- |
please help me in getting a similar output, any suggestions in the direction would be great
Solution 1:[1]
First group by customer_id and get list of purchased item_id with the associated amount like this:
import pyspark.sql.functions as F
items_by_customer_df = df.groupBy("customer_id").agg(
F.collect_set("item_id").alias("items"),
F.sum("amount").alias("target_amount")
)
items_by_customer_df.show()
#+-----------+---------------+-------------+
#|customer_id|items |target_amount|
#+-----------+---------------+-------------+
#|1 |[tx1, tx2, tx3]|39 |
#|2 |[tx1, tx6, tx4]|70 |
#|3 |[tx2, tx6] |53 |
#|5 |[tx6, tx8] |119 |
#|4 |[tx4] |12 |
#|6 |[tx6] |43 |
#+-----------+---------------+-------------+
Now, join this grouped dataframe with distinct item_ids from original df using negation of array_contains as condition, then group by item_id and do aggregations count(customer_id) + sum(amount):
result = df.select("item_id").distinct().join(
items_by_customer_df,
~F.array_contains("items", F.col("item_id"))
).groupBy("item_id").agg(
F.count("customer_id").alias("target_cust"),
F.sum("target_amount").alias("target_amount")
)
result.show()
#+-------+-----------+-------------+
#|item_id|target_cust|target_amount|
#+-------+-----------+-------------+
#| tx2| 4| 244|
#| tx4| 4| 254|
#| tx1| 4| 227|
#| tx8| 5| 217|
#| tx3| 5| 297|
#| tx6| 2| 51|
#+-------+-----------+-------------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | blackbishop |
